1 Introduction

2 Preliminaries

In this preliminary section, we’ll cover basic information that will help you to get started with RStudio.

2.1 R and RStudio Installation

If you haven’t already, please go ahead and install both the R and RStudio applications. R and RStudio must be installed separately; you should install R first, and then RStudio. The R application is a bare-bones computing environment that supports statistical computing using the R programming language; RStudio is a visually appealing, feature-rich, and user-friendly interface that allows users to interact with this environment in an intuitive way. Once you have both applications installed, you don’t need to open up R and RStudio separately; you only need to open and interact with RStudio (which will run R in the background).

The following subsections provide instructions on installing R and RStudio for the macOS and Windows operating systems. These instructions are taken from the “Setup” section of the Data Carpentry Course entitled R for Social Scientists. The Data Carpentry page also contains installation instructions for the Linux operating system; if you’re a Linux user, please refer to that page for instructions.

The Appendix to Garret Grolemund’s book Hands on Programming with R also provides an excellent overview of the R and RStudio installation process.

2.1.1 Windows Installation Instructions

  • Download R from the CRAN website
  • Run the .exe file that was just downloaded.
  • Go to the RStudio download page and under Installers select the “Windows” option.
  • Double click the file to install RStudio
  • Open RStudio to make sure it works.

2.1.2 macOS Installation Instructions

  • Download R from the CRAN website
  • Select the .pkg file for the latest R version.
  • Double click on the downloaded file to install R.
  • It is also a good idea to install XQuartz, which some packages require.
  • Go to the RStudio download page, and under Installers select the “macOS” option.
  • Double click the file to install RStudio
  • Open RStudio to make sure it works.

2.2 The RStudio Interface

Now that we’ve installed and opened up RStudio, let’s familiarize ourselves with the RStudio interface. When we open up RStudio, we’ll see a window that looks something like this:

RStudio Interface Open on Desktop

The RStudio Interface

If your interface doesn’t look exactly like this, it shouldn’t be a problem; we would expect to see minor cosmetic differences in the appearance of the interface across operating systems and computers (based on how they’re configured). However, you should see four distinct windows within the larger RStudio interface:

  • The top-left window is known as the Source window.
    • The Source window is where we can write our R scripts (including the code associated with this tutorial), and execute those scripts. We can also type in R code into the “Console” window (bottom-left window), but it is preferable to write our code in a script within the source window. That’s because scripts can be saved (while code written into the console cannot); writing scripts therefore allows us to keep track of what we’re doing, and facilitates the reproducibility of our work. Note that in some cases, we may not see a Source window when we first open RStudio. In that case, to start a new script, simply click the File button on the RStudio menu bar, scroll down to New File button, and then select R Script from the menu bar that opens up.
    • It’s also worth noting that the outputs of certain functions will appear in the Source window. In the context of our tutorial, when we want to view our datasets, we will use the View() function, which will display the relevant data within a new tab in the Source window.
  • The top-right window is the Environment/History pane of the RStudio interface.
    • The “Environment” tab of this window provides information on the datasets we’ve loaded into RStudio, as well as objects we have defined (we’ll talk about objects more later in the tutorial). -The “History” tab of the window provides a record of the R commands we’ve run in a given session.
  • The bottom-right window is the Files/Plots/Packages/Help/Viewer window.
    • The “Files” tab displays our computer’s directories and file structures and allows us to navigate through them without having to leave the R environment.
    • The “Plots” tab is the tab where we can view any visualizations that we create. Within the “Plots” tab, make note of the “Zoom” button, which we can use to enlarge the display of our visualizations if they’re too compressed in the “Plots” window. Also, note the “Export” button within the “Plots” tab (next to the “Zoom” button); we can use this button to export the displayed visualization to a .png or .jpeg file that can be used outside of RStudio.
    • The “Packages” tab provides information on which packages have been installed, as well as which packages are currently loaded (more on packages in Sections 2.3 and 2.4 below)
    • The “Help” tab displays documentation for R packages and functions. If you want to know more about how a package or function work, we can simply type a “?” followed by the package or function’s name (no space between the question mark and the name) and relevant information will be displayed within the “Help” tab.
    • The “Viewer” tab displays HTML output. If we write code that generates an HTML file, we can view it within the “Viewer” tab.
  • The bottom-left window is the Console/Terminal/Jobs window.
    • The “Console” tab is where we can see our code execute when we run our scripts, as well as certain outputs produced by those scripts. In addition, if there are any error or warning messages, they will be printed to the “Console” tab. We can also type code directly into the console, but as we noted earlier, it is better practice to write our code in a script and then run it from there.
    • The “Terminal”, “Jobs” tabs are not relevant for our workshop. We’ll briefly provide an overview of “R Markdown” towards the end of the lesson.

2.3 Install Packages

R is an open-source programming language for statistical computing that allows users to carry out a wide range of data analysis and visualization tasks (among other things). One of the big advantages of using R is that it has a very large user community among social scientists, statisticians, and digital humanists, who frequently publish R packages. One might think of packages as workbooks of sorts, which contain a well-integrated set of R functions, scripts, data, and documentation; these “workbooks” are designed to facilitate certain tasks or implement useful procedures. These packages are then shared with the broader R user community, and at this point, anyone who needs to accomplish the tasks to which the package addresses itself can use the package in the context of their own projects. The ability to use published packages considerably simplifies the work of applied data research using R; it means that we rarely have to write code entirely from scratch, and can build on the code that others have published in the form of packages. This allows applied researchers to focus on substantive problems, without having to get too bogged down in complicated programming tasks.

In this workshop, we will use the following packages to carry out relevant data analysis and visualization tasks (please click the relevant link to learn more about a given package; note that the tidyverse is not a single package, but rather an entire suite of packages used for common data science and analysis tasks): + tidyverse: + wosr

To install a package in R, we can use the install.packages() function. A function is essentially a programming construct that takes a specified input, runs this input (called an “argument”) through a set of procedures, and returns an output. In the code block below, the name of the package we want to install (here, the tidyverse suite) is enclosed within quotation marks and placed within parentheses after printing install.packages Running the code below will effectively download the tidyverse suite of packages to our computer:

# Installs "tm" package
install.packages("tidyverse")

To run this code in your own R session:

  • First, copy the code from the codeblock above (you can copy the code to your clipboard by hovering over the top-right of the code-block and clicking the “copy” icon; you can also highlight the code and copy from the Edit menu of your browser).
  • Then, start a new R script within RStudio; if you want to keep a future record of your work, you may want to save this script to your computer (perhaps in the same folder to which you downloaded the tutorial data). We can save our scripts via the RStudio “File” menu.
  • Paste the code into the script, highlight it, and click the “Run” button that is just above the Source window.
  • Alternatively, instead of copying/pasting, you can manually type in the code from the codeblock into your script (manually typing in the code is slower, but often a better way to learn than copy/pasting).
  • After you’ve run the code, watch the code execute in the console, and look for a message confirming that the package has been successfully installed.

Below, we can see how that line of code should look in your script, and how to run it:

Installing tidyverse in R Script

Installing tidyverse in R Script

Please note that you can follow along with the tutorial on your own computers by transferring all of the subsequent codeblocks into your script in just this way. Run each codeblock in your RStudio environment as you go, and you should be able to replicate the entire tutorial on your computer. You can copy-paste the workshop code if you wish, but we recommend actually retyping the code into your script, since this will help you to more effectively familiarize yourself with the process of writing code in R.

Note that the codeblocks in the tutorial usually have a comment, prefaced by a hash (“#”). When writing code in R (or any other command-line interface) it is good practice to preface one’s code with brief comments that describe what a block of code is doing. Writing these comments can allow someone else (or your future self) to read and quickly understand the code more easily than otherwise might be the case. The hash before the comment effectively tells R that the subsequent text is a comment, and should be ignored when running a script. If one does not preface the comment with a hash, R wouldn’t know to ignore the comment, and would throw an error message.

Now, let’s install the other packages we mentioned above, using the same install.packages() function:

install.packages("wosr")

All of the packages we need are now installed!

2.4 Load libraries

However, while our packages are installed, they are not yet ready to use. Before we can use our packages, we must load them into our environment. We can think of the process of loading installed packages into a current R environment as analogous to opening up an application on your phone or computer after it has been installed (even after an application has been installed, you can’t use it until you open it!). To load (i.e. “open”) an R package, we pass the name of the package we want to load as an argument to the library() function. For example, if we want to load our tidyverse packages into the current environment, we can type:

# Loads tidyverse packages into memory
library(tidyverse)

At this point, the full suite of the tidyverse suite’s functionality is available for us to use.

Now, let’s go ahead and load the remainder of the packages that we’ll need:

# loads remainder of required packages
library(wosr)
library(psych)
library(fastDummies)
library(janitor)
library(tidytext)
library(wordcloud2)

At this point, the packages are loaded and ready to go! One important thing to note regarding the installation and loading of packages is that we only have to install packages once; after a package is installed, there is no need to subsequently reinstall it. However, we must load the packages we need (using the library function) every time we open a new R session. In other words, if we were to close RStudio at this point and open it up later, we would not need to install these packages again, but would need to load the packages again.

3 Part 1: Foundations for Data Analysis in R

Before we can get a sense of how to work with data in R, it is important to familiarize ourselves with basic features of the R language’s syntax, and the basic data structures that are used to store and process data.

3.1 R as a Calculator

At its most basic, R can be used as a calculator. For instance:

# calculates 2+2
2+2
## [1] 4
# calculates 65 to the power of 4
65^4
## [1] 17850625

While this is a useful starting point, the possibility of assigning values to objects (or variables) considerably increases the scope of the operations we are able to carry out. We turn to object assignment in the next sub-section.

3.2 Object assignment and manipulation

The concept of object (or variable) assignment is a fundamental concept when working in a scripting environment; indeed, the ability to easily assign values to objects is what allows us to easily and intuitively manipulate and process our data in a programmatic setting. To better understand the mechanics of object assignment, consider the following:

# assign value 5 to new object named x
x<-5

In the code above, we use R’s assignment operator, <-, to assign the value 5 to an object named x. Now that an object named x has been created and assigned the value 5, printing x in our console (or printing x in our script and running it) will return the value that has been assigned to the x object, i.e. 5:

# prints value assigned to "x"
x
## [1] 5

More generally, the process of assignment effectively equates the output created by the code on the right side of the assignment operator (<-) to an object with a name that is specified on the left side of the assignment operator. Whenever we want to look at the contents of an object (i.e. the output created by the code to the right side of the assignment operator), we simply print the name of the object in the R console (or print the name and run it within a script).

Let’s create another object, named y, and assign it the value “12”:

# assign value 12 to new object named y
y<-12

As we noted above, we can print the value that was assigned to y by printing its name:

# prints value assigned to "y"
y
## [1] 12

It’s possible to use existing objects to assign values to new ones. For example, we can assign the sum of x and y to a new object that we’ll name xy_sum:

# creates a new object, named "xy_sum" whose value is the sum of "x" and "y"
xy_sum<-x+y

Now, let’s print the contents of xy_sum

# prints contents of "xy_sum"
xy_sum
## [1] 17

As expected, we see that the value assigned to xy_sum is “17” (i.e. the sum of the values assigned to x and y).

It is possible to change the value assigned to a given object. For example, let’s say we want to change the value assigned to x from “5” to “8”:

# assign value of "8" to object named "x"
x<-8

We can now confirm that x is now associated with the value “8”

# prints updated value of "x"
x
## [1] 8

It’s worth noting that updating the value assigned to x will not automatically update the value assigned to xy_sum (which, recall, is the sum of x and y). If we print the value assigned to xy_sum, note that it is still “17”):

xy_sum
## [1] 17

In order for the value assigned to xy_sum to be updated with the new value of x, we must run the assignment operation again:

# assigns sum of "y" and newly updated value of "x" to "xy_sum" object
xy_sum<-x+y

Now, the value of xy_sum should reflect the updated value of x, which we can confirm by printing the value of xy_sum:

# prints value of "xy_sum"
xy_sum
## [1] 20

Note that the value assigned to xy_sum is now “20” (the sum of “8” and “12”), rather than “17” (the sum of “5” and “12”).

While the examples above were very simple, we can assign virtually any R code, and by extension, the data structure(s) generated by that code (such as datasets, vectors, graphs/plots etc.) to an R object. When naming your objects, try to be descriptive, so that the name of the object signifies something about its corresponding value.

Below, consider a simple example of an object, named our_location that has been assigned a non-numeric value. It’s value is a string, or textual information:

# assigns text string "Boulder, CO" to 
our_location<-"Boulder, CO"

We can print string that has been assigned to the location object by typing the name of the object in our console, or running it from our script:

# prints value of "our_location" object
our_location
## [1] "Boulder, CO"

Note that generally speaking, you have a lot of flexibility in naming your R objects, but there are certain rules. For example, object names must start with a letter, and cannot contain any special symbols (they can only contain letters, numbers, underscores, and periods). Also, object names cannot contain multiple unconnected words; if you’d like to use multiple words or phrases, connect the discrete elements with an underscore (_), or use camel case (where different words are distinguished by beginning each discrete word begins with a capitalized letter).

It is also worth emphasizing that object names are case sensitive; in order to print the value assigned to an object, that object’s name must be printed exactly as it was created. For example, if we were to type our_Location, we would get an error, since there is no our_Location object (only an our_location object):

our_Location
## Error in eval(expr, envir, enclos): object 'our_Location' not found

3.3 Data structures

We now turn to a brief overview of some important data structures that help us to work with data in R. We will consider three data structures that are particularly useful: vectors, data frames, and lists. Note that this is not an exhaustive treatment of data structures in R; there are other structures, such as matrices and arrays, that are also important. However, we will limit our discussion to the data structures that are essential for getting started with data-based research in R.

3.3.1 Vectors

In R, a vector is a sequence of values. A vector is created using the c() function. For example, let’s make a vector with some arbitrary numeric values:

# makes vector with values 5,7,55,32
c(5, 7, 55, 32)
## [1]  5  7 55 32

If we plan to work with this numeric vector again later in our workflow, it makes sense to assign it to an object, which we’ll call arbitrary_values:

# assigns vector of arbitrary values to new object named "arbitrary_values"
arbitrary_values<-c(5,7,55.6,32.5)

Now, whenever we want to print the vector assigned to the arbitrary_values object, we can simply print the name of the object:

# prints vector assigned to "arbitrary_values" object
arbitrary_values
## [1]  5.0  7.0 55.6 32.5

It is possible to carry out mathematical operations with numeric vectors; for instance, let’s say that we want to double the values in the arbitrary_values vector; to do so, we can simply multiply arbitrary_values by 2, which yields a new vector where each numeric element is twice the corresponding element in arbitrary_values. Below, we’ll create a new vector that doubles the values in arbitrary_values, assign it to a new object named arbitrary_values_2x, and print the contents of arbitrary_values_2x:

# creates a new vector that doubles the values in "arbitrary_values" and assigns it to a new object named
"arbitrary_values_2x"
## [1] "arbitrary_values_2x"
arbitrary_values_2x<-arbitrary_values*2

# prints contents of "arbitrary_values_2x"
arbitrary_values_2x
## [1]  10.0  14.0 111.2  65.0

Now, let’s say we want to add different vectors together; the code below creates a new vector by adding together arbitrary_values and arbitrary_values_2x:

# adds "arbitrary_values" vector and "arbitrary_values_2x" vector
arbitrary_values + arbitrary_values_2x
## [1]  15.0  21.0 166.8  97.5

Note that each element of the resulting vector printed above is the sum of the corresponding elements in arbitrary_values and arbitrary_values_2x.

Other arithmetic operations on numeric vectors are also possible, and you may wish to explore these on your own as an exercise.

In many cases, it is useful to extract a specific element from a vector. Each element in a given vector is assigned an index number, starting with 1; that is, the first element in a vector is assigned an index value of 1, the second element of a vector is assigned an index value of 2, and so on. We can use these index values to extract our desired vector elements. In particular, we can specify the desired index within square brackets after printing the name of the vector object of interest. For example, let’s say we want to extract the 3rd element of the vector in arbitrary_values. We can do so with the following:

# extracts third element of "arbitrary_values_2x" vector
arbitrary_values[3]
## [1] 55.6

It is also possible to extract a range of values from a vector using index values. For example, let’s say we want to extract a new vector comprised of the second, third, and fourth numeric elements in arbitrary_values; we can do so with the following:

# extracts a new vector comprised of the 2nd, 3rd, and 4th elements of the existing "arbitrary_values" vector
arbitrary_values[2:4]
## [1]  7.0 55.6 32.5

Thus far, we have been working with numeric vectors, where each of the vector’s elements is a numeric value, but it is also possible to create vectors in which the elements are strings (i.e. text). Such vectors are know as character vectors. For example, the code below creates a character vector of the first four months of the year, and assigns it to a new object named months_four:

# creates character vector whose elements are the first four months of the year, and assigns the vector to a new object named "months_four"
months_four<-c("January", "February", "March", "April")

Let’s now print the character vector assigned to months_four:

# prints contents of "months_four"
months_four
## [1] "January"  "February" "March"    "April"

We can extract elements from character vectors using index values in the same way we did so for elements in a numeric vector. For example:

# extracts second element of "months_four" object (i.e. the "February" string)
months_four[2]
## [1] "February"
# subsets the second and third elements of "months_four" object (i.e. the "February" and "March" strings, which are extracted as a new character vector)
months_four[2:3]
## [1] "February" "March"

3.3.2 Data frames

The data frame structure is the workhorse of data analysis in R. A data frame resembles a table, of the sort you might generate in a spreadsheet application.

Often, the most important (and arduous) step in a data analysis workflow is to assemble disparate strands of data into a tractable data frame. What does it mean for a data frame to be “tractable”? One way to define this concept more precisely is to appeal to the concept of “tidy” data, which is often referenced in the data science world. Broadly speaking, a “tidy” data frame is a table in which:

  1. Each variable has its own column
  2. Each observation has its own row
  3. Each value has its own cell

We will work extensively with data frames later in the workshop, but let’s generate a simple data frame from scratch, and assign it to a new object. We will generate a data frame containing “dummy” country-level data on basic economic, geographic, and demographic variables, and assign it to a new object named country_df. The data frame is created through the use of the data.frame() function, which has already been programmed into R. Column names and the corresponding column values are passed to the data.frame() function in the manner below, and the function effectively binds these different columns together into a table:

# Creates a dummy country-level data frame 
country_df<-data.frame(Country=c("Country A", "Country B", "Country C"),
                       GDP=c(8000, 30000, 23500),
                       Population=c(2000, 5400, 10000),
                       Continent=c("South America", "Europe", "North America"))

To observe the structure of the table, we can print it to the R console by simply printing the name of the object to which it has been assigned:

# prints "country_df" data frame to console
country_df
##     Country   GDP Population     Continent
## 1 Country A  8000       2000 South America
## 2 Country B 30000       5400        Europe
## 3 Country C 23500      10000 North America

One nice feature of R Studio is that instead of simply printing our data frames into the console, we can view a nicely formatted version of our data frame by passing the name of the data frame object through the View() function. For example, the code below will bring up the country_df data frame as a new tab in R Studio:

# pulls up "country_df" data frame in R Studio data viewer
View(country_df)

Note the “tidy” features of this simple data frame:

  1. Each of the variables (i.e. GDP, Population, Continent) has its own column
  2. Each of the (country-level) observations has its own row
  3. Each of the values (i.e. country-level information about a given variable) has its own distinct cell

We will explore data frames, and the process of extracting information from them, at greater length in subsequent sections.

3.3.3 Lists

In R, a list is a data structure that allows us to conveniently store a variety of different objects, of various types. For example, we can use a list to vectors, data frames, visualizations and graphs–basically any R object you can think of! It is also possible to store a list within a list.

Lists allow us to keep track of the various objects we create, and are therefore a useful data management tool. In addition, lists are very helpful to use when we want to perform iterative operations across multiple objects.

We can create lists in R using the list() function; the arguments to this function are the objects that we want to include in the list. In the code below, we’ll create a list (assigned to an object named example_list) that contains some of the objects we create earlier in the lesson: the arbitrary_values vector, the months_four vector, and the country_df data frame.

# creates list whose elements are the "arbitrary_values" numeric vector, the "months_four" character vector, and the "country_df" data frame, and assigns it to a new object named "example_list"
example_list<-list(arbitrary_values, months_four, country_df)

Now that we’ve created our list object, let’s print out its contents:

# prints contents of "example_list"
example_list
## [[1]]
## [1]  5.0  7.0 55.6 32.5
## 
## [[2]]
## [1] "January"  "February" "March"    "April"   
## 
## [[3]]
##     Country   GDP Population     Continent
## 1 Country A  8000       2000 South America
## 2 Country B 30000       5400        Europe
## 3 Country C 23500      10000 North America

As you can see, our list contains each of the various specified objects within a single, unified structure. We can access specific elements within a list using the specific index number of the desired element, in much the same way we did for vectors. When extracting a single list element from a list, the convention is to enclose the index number of the desired list element in double square brackets. For example, if we want to extract the country-level data frame from example_list, we can use the following:

# extracts country-level data frame from "example_list"; the country-level data frame is the third element in "example_list"
example_list[[3]]
##     Country   GDP Population     Continent
## 1 Country A  8000       2000 South America
## 2 Country B 30000       5400        Europe
## 3 Country C 23500      10000 North America

If we want to subset a list, and extract more than one list element as a separate list, we can do so by creating a vector of the index values of the desired elements, and enclosing it in single brackets after the name of the list object. For example, if we wanted to generate a new list that contained only the first and third elements of example_list (the numeric vector of arbitrary values and the data frame), we would use the following syntax:

example_list[c(1,3)]
## [[1]]
## [1]  5.0  7.0 55.6 32.5
## 
## [[2]]
##     Country   GDP Population     Continent
## 1 Country A  8000       2000 South America
## 2 Country B 30000       5400        Europe
## 3 Country C 23500      10000 North America

While list elements are not automatically named, we can name our list element using the names() function. The first step to define a character vector of desired names. We can specify any names we’d like but for the sake of illustration, let’s say we want to name the first list element “element1”, the second list element “element2”, and the third list element “element3”. Let’s create a vector of our desired names, and assign it to an object named name_vector:

# creates a character vector of desired names for list elements, and assigns it to a new object named "name_vector"
name_vector<-c("element1", "element2", "element3")

Now, we’ll assign these names in name_vector to the list elements in example_list with the following

# assigns names from "name_vector" to list elements in "example_list"
names(example_list)<-name_vector

Let’s now print the contents of example_list:

# prints contents of "example_list"
example_list
## $element1
## [1]  5.0  7.0 55.6 32.5
## 
## $element2
## [1] "January"  "February" "March"    "April"   
## 
## $element3
##     Country   GDP Population     Continent
## 1 Country A  8000       2000 South America
## 2 Country B 30000       5400        Europe
## 3 Country C 23500      10000 North America

Note that the list elements now have names attached to them; the first character string in name_vector is assigned as the name of the first element in example_list, the second character string in name_vector is assigned as the name of the second element in example_list, and so on.

Practically speaking, we can now extract list elements using the assigned names. For example, if we want to extract the data frame from example_list, we could do so by its assigned name (“element3”), as follows:

# Extracts the data frame from "example_list" by its assigned name
example_list[["element3"]]
##     Country   GDP Population     Continent
## 1 Country A  8000       2000 South America
## 2 Country B 30000       5400        Europe
## 3 Country C 23500      10000 North America

Note that even after assigning names to list elements, you can still extract elements by their index value, if you would prefer to do so:

# # Extracts the "element3" data frame from "example_list" by its index number
example_list[[3]]
##     Country   GDP Population     Continent
## 1 Country A  8000       2000 South America
## 2 Country B 30000       5400        Europe
## 3 Country C 23500      10000 North America

3.3.4 Identifying data structures

It is useful to be able to quickly identify the data structure of a given object. Indeed, one way that things can go wrong when processing or analyzing data in R is that a given function expects a certain type of data structure as an input, but encounters something else, which will cause the function to throw an error or perform unexpectedly. In such circumstances, it is especially useful to be able to quickly double-check the data structure of a given object.

We can quickly ascertain this information by passing a given object as an argument to the class() function, which will provide information about the object’s data structure.

For example, let’s say we want to confirm that example_list is indeed a list:

# print the data structure of the "example_list" object
class(example_list)
## [1] "list"

Let’s take another example:

# print the data structure of the "months_four" object
class(months_four)
## [1] "character"

Note that we can read “character”, as “character vector”.

Similarly, we can read “numeric” as “numeric vector”:

# print the data structure of the "arbitrary_values" object
class(arbitrary_values)
## [1] "numeric"

3.4 Functions

As we mentioned earlier, a function is a programming construct that takes a set of inputs (also known as arguments), manipulates those inputs/arguments in a specific way (the body of the function), and returns an output that is the product of how those inputs are manipulated in the body of the function. It is much like a recipe, where the recipe’s ingredients are analogous to a function’s inputs, the instructions about how to combine and process those ingredients are analogous to the body of the function, and the end product of the recipe (for example, a cake) is analogous to the function’s output. R packages are essentially pre-written collections of functions organized around a given theme, and for a large number of data processing and analysis tasks, one can rely on these pre-written functions. In some cases, however, you may want to write your own functions from scratch.

Why might you want to write your own functions?

  • Sometimes, there won’t be a convenient pre-programmed function available to accomplish a given task, which will require you to write your own custom function.
  • Writing your own functions will allow you to automate your workflows
  • Writing functions will allow you to write more concise and readable code.

Writing your own functions can be challenging, but this section will provide you with some basic intuition for how the process works. To develop this intuition, we’ll use a very simple example.

Let’s say you have a large collection of temperature data, measured in Fahrenheit, and you want to convert these data to Celsius. Recall that the formula to convert from Fahrenheit to Celsius is the following, where “C” represents temperature in Celsius, and “F” represents temperature in Fahrenheit:

# fahrenheit to Celsius formula, where "F" is fahrenheit input
C=(F-32)*(5/9)

Recall that at its most basic level, R is a calculator; if for example, we have a Fahrenheit measurement of 55 degrees, we can convert this to Celsius by plugging 55 into the conversion formula:

# Converts 55 degrees fahrenheit to Celsius
(55-32)*(5/9)
## [1] 12.77778

This is easy enough, but if we have a large amount of temperature data that requires processing, we wouldn’t want to carry out this calculation using arithmetic operators for each measurement in our data collection; that could quickly become unwieldy and tedious. Instead of repeatedly using arithmetic operators, we can wrap the Fahrenheit-to-Celsius conversion formula into a function:

# Generates function that takes fahrenheit value ("fahrenheit_input") and returns a value in Celsius, and assigns the function to an object named "fahrenheit_to_celsius_converter"
fahrenheit_to_celsius_converter<-function(fahrenheit_input){
  celsius_output<-(fahrenheit_input-32)*(5/9)
  return(celsius_output)
}

Let’s unpack the code above, which we used to create our function:

  • We declare that we are creating a new function with the word function; within the parenthesis after function, we specify the function’s argument(s). Here, the function’s argument is an input named fahrenheit_input. The name of the argument(s) is arbitrary, and can be anything you like; ideally, its name should be informed by relevant context. Here, the argument/input to the function is a temperature value expressed in degrees Fahrenheit, so the name “fahrenheit_input” describes the nature of this input.
  • After enclosing the function’s arguments within parentheses, we print a right-facing curly brace {, and then define the body of the function (i.e. the recipe), which specifies how we want to transform this input. In particular, we take fahrenheit_input, subtract 32, and then multiply by 5/9, which transform the input to the celsius temperature scale. We’ll tell R to assign this transformed value to a new object, named celsius_output.
  • In the function’s final line, return(celsius_output), we specify the value we want the function to return. Here, we are saying that we want the function to return the value that was assigned to celsius_output. We then close the function by typing a left-facing curly brace below the return statement }.
  • Just as we can assign data or visualizations to objects that allow us to subsequently retrieve the outputs of our code, so too with functions. Here, we’ll assign the function we have just return to an object named fahrenheit_to_celsius_converter.

After creating our function by running that code, we can use the newly created fahrenheit_to_celsius function to perform our Fahrenheit to Celsius transformations. Let’s say we have a Fahrenheit value of 68, and want to transform it to Celsius. Instead of the following calculation:

# Uses arithmetic operation to convert 68 degrees Fahrenheit to Celsius
(68-32)*(5/9)
## [1] 20

We can use our function:

# Uses "fahrenheit_to_celsius_converter" function to convert 68 degrees Fahrenheit to Celsius
fahrenheit_to_celsius_converter(fahrenheit_input=68)
## [1] 20

Above, we passed the argument “fahrenheit_input=68” to the fahrenheit_to_celsius_converter function that we created; the function then took this value (68), plugged it into “fahrenheit_input” within the function and assigned the resulting value to “celsius_output”; it then returned the value of “celsius_output” (20) back to us.

Let’s try another one:

fahrenheit_to_celsius_converter(fahrenheit_input=22)
## [1] -5.555556

In short, we can specify any value for the “fahrenheit_input” argument; this value will be substituted for “fahrenheit_input” in the expression celsius_output<-(fahrenheit_input-32)*(5/9), after which the value of celsius_output will be returned to us.

Even though the Fahrenheit to Celsius conversion formula is not particularly complex, it is clear that writing a function to perform this calculation is nonetheless more efficient than repeatedly performing the relevant arithmetic operation. As the operations you need to perform on your data become more complex, and the number of times you need to perform those operations increases, the benefits of wrapping those operations into a function become ever-more apparent.

3.5 Iteration

Once we have a function written down, it is straightforward to apply that function to multiple inputs in an iterative fashion. For example, let’s say you have four different Fahrenheit temperature values that you would like to convert to celsius, using the fahrenheit_to_celsius_converter we developed above. One option would be to apply the fahrenheit_to_celsius_converter function to each of the Fahrenheit temperature inputs individually. For example, let’s say our Fahrenheit values, which we’d like to convert to celsius, are the following: 45.6, 95.9, 67.8, 43. We could, of course, run these values through the function individually, as below:

fahrenheit_to_celsius_converter(fahrenheit_input=45.6)
## [1] 7.555556
fahrenheit_to_celsius_converter(fahrenheit_input=95.9)
## [1] 35.5
fahrenheit_to_celsius_converter(fahrenheit_input=67.8)
## [1] 19.88889
fahrenheit_to_celsius_converter(fahrenheit_input=43.)
## [1] 6.111111

This is manageable with a collection of only four Fahrenheit values, but would quickly become tedious if you had a substantially larger set of Fahrenheit temperature values that required conversion. Instead of manually applying the function to each individual input value, we can instead put these values into a vector, and then iteratively apply the fahrenheit_to_celsius_converter function to each of these vector elements.

Let’s first assign our Fahrenheit temperature values to a numeric vector object named fahrenheit_input_vector:

# makes a vector out of Fahrenheit values we want to convert, and assigns it to a new object named "fahrenheit_input_vector"
fahrenheit_input_vector<-c(45.6, 95.9, 67.8, 43)

Our goal is to also iteratively apply our function to all of these vector elements, and deposit the transformed results into a new vector. In programming languages, functions are typically applied to to multiple inputs in an iterative fashion using a construct known as a for-loop, which some of you may already be familiar with. R users also frequently use specialized functions (instead of for-loops) to iterate over elements; this is often faster, or at the very least, makes R scripts more readable. One family of these iterative functions is the “Apply” family of functions. A more recent set of functions that facilitate iteration is part of the tidyverse, and is found within the purrr package. These functions are known as map() functions, and we will use them here to iteratively apply our functions to multiple inputs.

Let’s see how we can use a map() function to sequentially apply the fahrenheit_to_celsius_converter() function we created to several different values for the “fahrenheit_input” argument, contained in fahrenheit_input_vector. We’ll pass fahrenheit_input_vector as the first argument to the map_dbl() function, and fahrenheit_to_celsius_converter (i.e. the function we want to apply iteratively to the elements in `thefahrenheit_input_vector ) as the second argument. The result of this operation will be a new “results vector”, containing the transformed temperature values for each input in the original vector of Fahrenheit values (fahrenheit_input_vector). We’ll assign this result/output vector to a new object named celsius_outputs_vector:

# Iteratively applies the "fahrenheit_to_celsius_converter" to celsius input values in "fahrenheit_input_vector" and assigns the resulting vector of converted temperature values to "celsius_ouputs_vector"
celsius_outputs_vector<-map_dbl(fahrenheit_input_vector, fahrenheit_to_celsius_converter)

In short, the code above takes ``fahrenheit_input_vector(i.e. a vector with the numbers 45.6, 95.9, 67.8, 43), and runs each of these numbers through thefahrenheit_converter()function, and sequentially deposits the transformed result to the newly createdcelsius_outputs_vector()``` object, which contains the following elements:

# prints contents of "celsius_outputs_vector"
celsius_outputs_vector
## [1]  7.555556 35.500000 19.888889  6.111111

More explicitly, the code that reads celsius_outputs_vector<-map_dbl(fahrenheit_input_vector, fahrenheit_converter) did the following:

  1. Pass 45.6 (the first element in the input vector, fahrenheit_input_vector) to the fahrenheit_converter() function, and place the output (7.555556) as the first element in a new vector of transformed values, named celsius_outputs_vector.
  2. Pass 95.9 (the second element in the input vector, fahrenheit_input_vector) to the fahrenheit_converter() function, and deposit the output (35.500000) as the second element in celsius_outputs_vector.
  3. Pass 67.8 (the third element in the input vector, fahrenheit_input_vector) to the fahrenheit_converter() function, and deposit the output (19.888889) as the third element in celsius_outputs_vector.
  4. Pass 43 (the fourth element in the input vector, fahrenheit_input_vector) to the fahrenheit_converter() function, and deposit the output (6.111111) as the fourth element in celsius_outputs_vector.

There are a variety of map() functions from the purrr package, and the precise one you should use turns on the number of arguments used by the function (in this example, there is of course only one argument, i.e. “fahrenheit_input”), and the desired class of the output (i.e. numeric vector, character vector, data frame, list etc.). For example, let’s say we want to apply the fahrenheit_to_celsius_converter function iteratively to the input values in fahrenheit_input_vector, but that we want the output values to be stored as a list, rather than as a vector. Instead of using the map_dbl() function, we can use the map() function, which always returns outputs as a list. Below, we pass our input vector (fahrenheit_input_vector), and the function we want to iteratively apply to the elements of the input vector (fahrenheit_converter) to the map() function. We’ll assign the output list to a new object named celsius_outputs_list:

# iteratively applies the "fahrenheit_to_celsius_converter" function to the input values in "fahrenheit_input_vector", and assigns the list of celsius output values to a new object named "celsius_outputs_list"
celsius_outputs_list<-map(fahrenheit_input_vector, fahrenheit_to_celsius_converter)

Let’s print out the list of output values:

# prints contents of "celsius_outputs_list"
celsius_outputs_list
## [[1]]
## [1] 7.555556
## 
## [[2]]
## [1] 35.5
## 
## [[3]]
## [1] 19.88889
## 
## [[4]]
## [1] 6.111111

We can confirm that celsius_outputs_list is indeed a list using the class() function that we introduced earlier:

# checks data structure of "celsius_outputs_list"
class(celsius_outputs_list)
## [1] "list"

Now, let’s say we we want to organize our information in a data frame, where one column represents our Fahrenheit input values, and the other column represents the corresponding Celsius output values. To do so, we’ll first slightly modify our function to return a data frame:

# Creates function that takes an input value in degrees Fahrenheit (fahrenheit_input), converts this value to Celsius, and returns a data frame with the input Fahrenheit temperature value as one column, and the corresponding Celsius temperature value as another column; the function is assigned to a new object named "fahrenheit_to_celsius_converter_df" 
fahrenheit_to_celsius_converter_df<-function(fahrenheit_input){
  celsius_output<-(fahrenheit_input-32)*(5/9)
  celsius_output_df<-data.frame(fahrenheit_input, celsius_output)
  return(celsius_output_df)
}

Now, let’s test out this new function for a single “fahrenheit_input” value, to make sure it works as expected; we’ll test it out for a value of 63 degrees Fahrenheit:

# applies "fahrenheit_to_celsius_converter_df" function to input value of 63 degrees Fahrenheit
fahrenheit_to_celsius_converter_df(fahrenheit_input=63)
##   fahrenheit_input celsius_output
## 1               63       17.22222

Having confirmed that the function works as expected, let’s now assemble a dataset using multiple Fahrenheit input values, where one column consists of these input values, and the second column consists of the corresponding Celsius outputs. We can do so using the map_dfr() function from the purrr package, which is a cousin of the map() and map_dbl() functions we explored above. While the map() function returns function outputs in a list, and the map_dbl() function returns function outputs in a numeric vector, the map_dfr() is used to bind together multiple function outputs rowwise into a data frame. To make this more concrete, let’s consider the code below, which uses map_dfr() to iteratively apply the fahrenheit_to_celsius_converter_df function to the Fahrenheit values in fahrenheit_input_vector, and assemble the resulting rows into a data frame that is assigned to a new object named celsius_outputs_df:

# Iteratively applies the "fahrenheit_to_celsius_converter_df" function to input values in "fahrenheit_input_vector" to generate a data frame with column of input Fahrenheit values, and column of corresponding output Celsius values; assigns this data frame to a new object named "celsius_outputs_df"
celsius_outputs_df<-map_dfr(fahrenheit_input_vector, fahrenheit_to_celsius_converter_df)

Let’s now print the contents of celsius_outputs_df:

# prints contents of 
celsius_outputs_df
##   fahrenheit_input celsius_output
## 1             45.6       7.555556
## 2             95.9      35.500000
## 3             67.8      19.888889
## 4             43.0       6.111111

We now have a dataset with one column consisting of our Fahrenheit inputs (taken from fahrenheit_input_vector), and a second column consisting of our Celsius outputs (derived by applying the fahrenheit_to_celsius_converter_df() function to our vector of input values, `fahrenheit_input_vector).

We’ve just covered three different purrr functions: map() (which returns a list), map_dbl() (which returns a vector), and map_dfr() (which returns a dataframe). There are other map functions which return different types of objects; you can see a list of these other map functions by inspecting the documentation for the map() function:

?map

The process of iteratively applying a function with more than one argument is beyond the scope of the workshop, but the same general principles are at work in those cases. If you’d like to explore the process of iteratively applying a function with two arguments, or more than two arguments, check out the documentation for the map2() and pmap() functions, respectively.

Before we move into the next section, let’s consider one more example of how you can use your own custom-written functions in conjunction with the iteration functions in the purrr package to write scripts that can help you to automate tedious tasks. In particular, we’ll demonstrate the utility of the list data structure in helping you to carry out such automation tasks.

Let’s say, for example, that you have temperature values stored in Fahrenheit, for multiple countries, and want to quickly convert those country-level values to degrees Celsius. Suppose that these Fahrenheit values are stored in a series of vectors:

# creates sample country-level Fahrenheit data for Country A
countryA_fahrenheit<-c(55,67,91,23, 77, 98, 27)

# creates sample country-level Fahrenheit data for Country B
countryB_fahrenheit<-c(33,45,11,66, 44)

# creates sample country-level Fahrenheit data for Country C
countryC_fahrenheit<-c(60,55,12,109)

# creates sample country-level Fahrenheit data for Country D
countryD_fahrenheit<-c(76, 24, 77, 78)

Let’s say that we want to take all of these vectors, and iteratively pass them as arguments to the fahrenheit_to_celsius_converter_df function, thereby creating four country-specific data frames that have the original Fahrenheit values in one column and the transformed Celsius values in the other column. The easiest way to do this is to first put our input vectors into a list, which we’ll assign to a new object named temperature_input_list:

# Creates list of input vectors and assigns this list to new object named "input_list"
temperature_input_list<-list(countryA_fahrenheit, countryB_fahrenheit, countryC_fahrenheit, countryD_fahrenheit) 

Now, we’ll use the map() function to iteratively pass the vectors in temperature_input_list as arguments to the fahrenheit_to_celsius_converter_df function, and deposits the resulting data frames into a list; we’ll assign this list that contains the output data frames to a new list object, named processed_temperature_data_list:

# Iteratively passes vectors in "temperature_input_list" as arguments to "fahrenheit_to_celsius_converter_df" and deposits the resulting data frames to a list, which is assigned to a new object named "processed_temperature_data_list"
processed_temperature_data_list<-map(temperature_input_list, fahrenheit_to_celsius_converter_df)

In effect, the code above takes the countryA_fahrenheit vector, uses it as the argument to the fahrenheit_to_celsius_converter_df function, and deposits the resulting data frame as the first element in the processed_temperature_data_list list; it then takes the countryB_fahrenheit vector, uses it as the argument to the fahrenheit_to_celsius_converter_df function, and deposits the resulting data frame as the second element in the processed_temperature_data_list list; and so on.

Let’s print the contents of processed_temperature_data_list and confirm that our data frames have been created as expected:

# prints contents of "processed_temperature_data_list"
processed_temperature_data_list
## [[1]]
##   fahrenheit_input celsius_output
## 1               55      12.777778
## 2               67      19.444444
## 3               91      32.777778
## 4               23      -5.000000
## 5               77      25.000000
## 6               98      36.666667
## 7               27      -2.777778
## 
## [[2]]
##   fahrenheit_input celsius_output
## 1               33      0.5555556
## 2               45      7.2222222
## 3               11    -11.6666667
## 4               66     18.8888889
## 5               44      6.6666667
## 
## [[3]]
##   fahrenheit_input celsius_output
## 1               60       15.55556
## 2               55       12.77778
## 3               12      -11.11111
## 4              109       42.77778
## 
## [[4]]
##   fahrenheit_input celsius_output
## 1               76      24.444444
## 2               24      -4.444444
## 3               77      25.000000
## 4               78      25.555556

As an exercise, try and extract a given dataset from processed_temperature_data_list using the indexing method we discussed above. Additionally, see if you can assign names to the list elements in processed_temperature_data_list.

4 Part 2: Applied Data Work in R

The material in Part 1 was not intended as a comprehensive introduction to the R programming language. Its goal, rather, was to present some ideas, concepts, and tools that can serve as a general foundation for working with data in R. Now that we have this basic foundation, we’ll turn in this section to a more applied exploration of some actual datasets. Our goal here is to introduce you to some useful functions that will allow you to explore and begin making sense of actual datasets in R.

4.1 Data Transfer Part 1: Reading in Data

Typically, the first step when working with research data in R Studio is to load your relevant data into memory. There are many ways to do this, and the precise way in which you will do so will depend on where your data is stored, and how it is structured. Below, we’ll cover the process of reading your data into R Studio under a couple of different scenarios.

4.1.1 Reading in a dataset from a directory on your computer

Often (especially when a dataset is of tractable size), you will have the dataset you would like to analyze stored on a directory on your computer. In order to read in a dataset from a computer directory

# Reads in Persson/Tabellini Data from local directory
pt<-read_csv("data/pt/persson_tabellini_workshop.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   country = col_character(),
##   continent = col_character()
## )
## ℹ Use `spec()` for the full column specifications.

4.1.2 reading in multiple datasets from your disk

# print relevant file names
wos_files<-list.files("data/wos")
# prints contents of "wos_files"
wos_files
##  [1] "ClimateAndArt_01.csv" "ClimateAndArt_02.csv" "ClimateAndArt_03.csv"
##  [4] "ClimateAndArt_04.csv" "ClimateAndArt_05.csv" "ClimateAndArt_06.csv"
##  [7] "ClimateAndArt_07.csv" "ClimateAndArt_08.csv" "ClimateAndArt_09.csv"
## [10] "ClimateAndArt_10.csv" "ClimateAndArt_11.csv" "ClimateAndArt_12.csv"
## [13] "ClimateAndArt_13.csv"
# Iteratively reads in all individual WOS files from the "data/wos" directory and assigns it to an object named "wos_file_list"
setwd("data/wos")
wos_file_list<-map(wos_files, read_csv)
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_character(),
##   `Book Series Subtitle` = col_logical(),
##   `Cited References` = col_logical(),
##   `Cited Reference Count` = col_double(),
##   `Times Cited, WoS Core` = col_double(),
##   `Times Cited, All Databases` = col_double(),
##   `180 Day Usage Count` = col_double(),
##   `Since 2013 Usage Count` = col_double(),
##   `Publication Year` = col_double(),
##   `Meeting Abstract` = col_logical(),
##   `Number of Pages` = col_double(),
##   `Pubmed Id` = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_character(),
##   `Book Series Subtitle` = col_logical(),
##   `Cited References` = col_logical(),
##   `Cited Reference Count` = col_double(),
##   `Times Cited, WoS Core` = col_double(),
##   `Times Cited, All Databases` = col_double(),
##   `180 Day Usage Count` = col_double(),
##   `Since 2013 Usage Count` = col_double(),
##   `Publication Year` = col_double(),
##   `Number of Pages` = col_double(),
##   `Pubmed Id` = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_character(),
##   `Book Series Subtitle` = col_logical(),
##   `Cited References` = col_logical(),
##   `Cited Reference Count` = col_double(),
##   `Times Cited, WoS Core` = col_double(),
##   `Times Cited, All Databases` = col_double(),
##   `180 Day Usage Count` = col_double(),
##   `Since 2013 Usage Count` = col_double(),
##   `Publication Year` = col_double(),
##   `Meeting Abstract` = col_logical(),
##   `Number of Pages` = col_double(),
##   `Pubmed Id` = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
## 
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_character(),
##   `Book Series Subtitle` = col_logical(),
##   `Cited References` = col_logical(),
##   `Cited Reference Count` = col_double(),
##   `Times Cited, WoS Core` = col_double(),
##   `Times Cited, All Databases` = col_double(),
##   `180 Day Usage Count` = col_double(),
##   `Since 2013 Usage Count` = col_double(),
##   `Publication Year` = col_double(),
##   `Meeting Abstract` = col_logical(),
##   `Number of Pages` = col_double(),
##   `Pubmed Id` = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
## 
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_character(),
##   `Book Series Subtitle` = col_logical(),
##   `Cited References` = col_logical(),
##   `Cited Reference Count` = col_double(),
##   `Times Cited, WoS Core` = col_double(),
##   `Times Cited, All Databases` = col_double(),
##   `180 Day Usage Count` = col_double(),
##   `Since 2013 Usage Count` = col_double(),
##   `Publication Year` = col_double(),
##   `Meeting Abstract` = col_logical(),
##   `Number of Pages` = col_double(),
##   `Pubmed Id` = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
## 
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_character(),
##   `Book Series Subtitle` = col_logical(),
##   `Cited References` = col_logical(),
##   `Cited Reference Count` = col_double(),
##   `Times Cited, WoS Core` = col_double(),
##   `Times Cited, All Databases` = col_double(),
##   `180 Day Usage Count` = col_double(),
##   `Since 2013 Usage Count` = col_double(),
##   `Publication Year` = col_double(),
##   `Meeting Abstract` = col_logical(),
##   `Number of Pages` = col_double(),
##   `Pubmed Id` = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
## 
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_character(),
##   `Book Series Subtitle` = col_logical(),
##   `Cited References` = col_logical(),
##   `Cited Reference Count` = col_double(),
##   `Times Cited, WoS Core` = col_double(),
##   `Times Cited, All Databases` = col_double(),
##   `180 Day Usage Count` = col_double(),
##   `Since 2013 Usage Count` = col_double(),
##   `Publication Year` = col_double(),
##   `Meeting Abstract` = col_logical(),
##   `Number of Pages` = col_double(),
##   `Pubmed Id` = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_character(),
##   `Group Authors` = col_logical(),
##   `Book Series Subtitle` = col_logical(),
##   `Cited References` = col_logical(),
##   `Cited Reference Count` = col_double(),
##   `Times Cited, WoS Core` = col_double(),
##   `Times Cited, All Databases` = col_double(),
##   `180 Day Usage Count` = col_double(),
##   `Since 2013 Usage Count` = col_double(),
##   `Publication Year` = col_double(),
##   `Meeting Abstract` = col_logical(),
##   `Number of Pages` = col_double(),
##   `Pubmed Id` = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_character(),
##   `Book Series Subtitle` = col_logical(),
##   `Cited References` = col_logical(),
##   `Cited Reference Count` = col_double(),
##   `Times Cited, WoS Core` = col_double(),
##   `Times Cited, All Databases` = col_double(),
##   `180 Day Usage Count` = col_double(),
##   `Since 2013 Usage Count` = col_double(),
##   `Publication Year` = col_double(),
##   `Meeting Abstract` = col_logical(),
##   `Number of Pages` = col_double(),
##   `Pubmed Id` = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
## 
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_character(),
##   `Book Series Subtitle` = col_logical(),
##   `Cited References` = col_logical(),
##   `Cited Reference Count` = col_double(),
##   `Times Cited, WoS Core` = col_double(),
##   `Times Cited, All Databases` = col_double(),
##   `180 Day Usage Count` = col_double(),
##   `Since 2013 Usage Count` = col_double(),
##   `Publication Year` = col_double(),
##   `Meeting Abstract` = col_logical(),
##   `Number of Pages` = col_double(),
##   `Pubmed Id` = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
## 
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_character(),
##   `Book Series Subtitle` = col_logical(),
##   `Cited References` = col_logical(),
##   `Cited Reference Count` = col_double(),
##   `Times Cited, WoS Core` = col_double(),
##   `Times Cited, All Databases` = col_double(),
##   `180 Day Usage Count` = col_double(),
##   `Since 2013 Usage Count` = col_double(),
##   `Publication Year` = col_double(),
##   `Meeting Abstract` = col_logical(),
##   `Number of Pages` = col_double(),
##   `Pubmed Id` = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
## 
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_character(),
##   `Book Series Subtitle` = col_logical(),
##   `Cited References` = col_logical(),
##   `Cited Reference Count` = col_double(),
##   `Times Cited, WoS Core` = col_double(),
##   `Times Cited, All Databases` = col_double(),
##   `180 Day Usage Count` = col_double(),
##   `Since 2013 Usage Count` = col_double(),
##   `Publication Year` = col_double(),
##   `Meeting Abstract` = col_logical(),
##   `Number of Pages` = col_double(),
##   `Pubmed Id` = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
## 
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_character(),
##   `Book Series Subtitle` = col_logical(),
##   `Cited References` = col_logical(),
##   `Cited Reference Count` = col_double(),
##   `Times Cited, WoS Core` = col_double(),
##   `Times Cited, All Databases` = col_double(),
##   `180 Day Usage Count` = col_double(),
##   `Since 2013 Usage Count` = col_double(),
##   `Publication Year` = col_double(),
##   `Meeting Abstract` = col_logical(),
##   `Number of Pages` = col_double(),
##   `Pubmed Id` = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
# appends data frames in "wos_file_list" into one data frame and assigns it to a new object named "ws_df_appended"
ws_df_appended<-bind_rows(wos_file_list)

4.1.3 reading in a dataset from cloud storage

# Reads in PT dataset from dropbox and assigns it to a new object named "pt_cloud"
pt_cloud<-read_csv("https://www.dropbox.com/s/iczslf52s8bzku2/persson_tabellini_workshop.csv?dl=1")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   country = col_character(),
##   continent = col_character()
## )
## ℹ Use `spec()` for the full column specifications.

4.1.4 reading in data from an R package

WOSR

4.1.5 extracting data from the web

# Reads in published dataset from CU Scholar and assigns it to a new object named "green_space_CUScholar"
green_space_CUScholar<-read_csv("https://scholar.colorado.edu/downloads/76537257b.csv")
## Warning: Missing column names filled in: 'X1' [1]
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   time_period = col_character(),
##   age_group = col_character(),
##   sex = col_character(),
##   ethnicity = col_character(),
##   race2 = col_character(),
##   cohabit = col_character(),
##   insured = col_character(),
##   looking_for_work = col_character(),
##   BAplus = col_character(),
##   income = col_character(),
##   Q33_1 = col_character(),
##   Q33_2 = col_character(),
##   Q33_3 = col_character(),
##   Q33_4 = col_character(),
##   Q33_5 = col_character(),
##   diagnosed = col_character()
## )
## ℹ Use `spec()` for the full column specifications.

Landing Page Download Link

4.2 Numeric Data Processing, Manipulation, and Visualization

4.2.1 Make a copy of the dataset

# Make a copy of the dataset so we don't alter the original dataset; then, view
# the copied dataset 
pt_copy<-pt
# Print contents of "pt_copy"
pt_copy
## # A tibble: 85 × 75
##     oecd country     pind pindo ctrycd col_uk t_indep col_uka col_espa col_otha
##    <dbl> <chr>      <dbl> <dbl>  <dbl>  <dbl>   <dbl>   <dbl>    <dbl>    <dbl>
##  1     0 Argentina  0     0        213      0     183   0        0.268    0    
##  2     1 Australia  1     1        193      1      98   0.608    0        0    
##  3     1 Austria    0     0        122      0     250   0        0        0    
##  4     0 Bahamas    1     1        313      1      26   0.896    0        0    
##  5     0 Bangladesh 1     1        513      0      28   0        0        0.888
##  6     0 Barbados   1     1        316      1      33   0.868    0        0    
##  7     0 Belarus    1     1        913      0       8   0        0        0.968
##  8     1 Belgium    0     0        124      0     169   0        0        0.324
##  9     0 Belize     1     1        339      1      18   0.928    0        0    
## 10     0 Bolivia    0.116 0.116    218      0     174   0        0.304    0    
## # … with 75 more rows, and 65 more variables: legor_uk <dbl>, legor_so <dbl>,
## #   legor_fr <dbl>, legor_ge <dbl>, legor_sc <dbl>, prot80 <dbl>,
## #   catho80 <dbl>, confu <dbl>, avelf <dbl>, govef <dbl>, graft <dbl>,
## #   logyl <dbl>, loga <dbl>, yrsopen <dbl>, gadp <dbl>, engfrac <dbl>,
## #   eurfrac <dbl>, frankrom <dbl>, latitude <dbl>, gastil <dbl>, cgexp <dbl>,
## #   cgrev <dbl>, ssw <dbl>, rgdph <dbl>, trade <dbl>, prop1564 <dbl>,
## #   prop65 <dbl>, federal <dbl>, eduger <dbl>, spropn <dbl>, yearele <dbl>, …
# 
View(pt_copy)

4.2.2 Summary Statistics

# Generate summary statistics for "pt_copy" and assign to new object named "pt_copy_summarystats1"
pt_copy_summarystats1<-describe(pt_copy)
# View contents of "pt_copy_summarystats1" in data viewer
View(pt_copy_summarystats1)
# Creates summary statistics for each continent grouping, and puts results in list named "summary_stats_by_continent"
summary_stats_by_continent<-describeBy(pt_copy, pt_copy$continent)
# Accessing continent-level summary statistics for africa from the "summary_stats_by_continent" list
summary_stats_by_continent[["africa"]]
##            vars  n    mean      sd  median trimmed    mad     min     max
## oecd          1 11    0.00    0.00    0.00    0.00   0.00    0.00    0.00
## country*      2 11    6.00    3.32    6.00    6.00   4.45    1.00   11.00
## pind          3 11    0.77    0.42    1.00    0.83   0.00    0.00    1.00
## pindo         4 11    0.77    0.42    1.00    0.83   0.00    0.00    1.00
## ctrycd        5 11  647.55  154.90  684.00  685.56  56.34  199.00  754.00
## col_uk        6 11    0.82    0.40    1.00    0.89   0.00    0.00    1.00
## t_indep       7 11   36.64   19.77   35.00   33.89   5.93    9.00   89.00
## col_uka       8 11    0.69    0.35    0.86    0.74   0.02    0.00    0.92
## col_espa      9 11    0.00    0.00    0.00    0.00   0.00    0.00    0.00
## col_otha     10 11    0.15    0.33    0.00    0.07   0.00    0.00    0.96
## legor_uk     11 11    0.82    0.40    1.00    0.89   0.00    0.00    1.00
## legor_so     12 11    0.00    0.00    0.00    0.00   0.00    0.00    0.00
## legor_fr     13 11    0.18    0.40    0.00    0.11   0.00    0.00    1.00
## legor_ge     14 11    0.00    0.00    0.00    0.00   0.00    0.00    0.00
## legor_sc     15 11    0.00    0.00    0.00    0.00   0.00    0.00    0.00
## prot80       16 11   22.17   20.23   25.80   19.96  19.57    0.10   64.20
## catho80      17 11   19.46   13.67   18.70   18.07  13.20    1.90   49.60
## confu        18 11    0.00    0.00    0.00    0.00   0.00    0.00    0.00
## avelf        19 11    0.71    0.14    0.73    0.73   0.15    0.38    0.84
## govef        20 11    5.37    0.82    5.02    5.25   0.68    4.56    7.26
## graft        21 11    5.11    0.77    5.39    5.12   0.80    3.93    6.23
## logyl        22 11    7.93    0.78    7.75    7.90   0.53    6.95    9.13
## loga         23 11    7.38    0.66    7.33    7.37   0.55    6.28    8.58
## yrsopen      24 11    0.21    0.29    0.16    0.15   0.18    0.00    1.00
## gadp         25 11    0.55    0.12    0.54    0.55   0.12    0.37    0.74
## engfrac      26 11    0.02    0.04    0.00    0.02   0.00    0.00    0.09
## eurfrac      27 11    0.07    0.17    0.00    0.03   0.00    0.00    0.57
## frankrom     28 11    2.90    0.51    2.94    2.86   0.56    2.19    3.95
## latitude     29 11   -9.14   15.17  -15.81   -9.58   8.49  -29.13   14.77
## gastil       30 11    3.59    1.16    4.00    3.66   1.32    1.61    4.89
## cgexp        31 10   27.00    7.63   25.50   27.10   8.58   14.65   38.57
## cgrev        32  9   26.15   10.36   23.81   26.15   6.14   17.24   50.85
## ssw          33  6    1.67    1.46    0.94    1.67   0.58    0.44    3.80
## rgdph        34 11 1899.87 1832.60 1116.28 1522.39 738.30  530.22 6666.77
## trade        35 11   77.34   32.13   69.17   76.87  27.13   30.83  128.12
## prop1564     36 11   54.23    4.91   53.23   53.51   2.96   49.05   65.95
## prop65       37 11    3.28    1.16    2.80    3.06   0.65    2.34    6.26
## federal      38 11    0.00    0.00    0.00    0.00   0.00    0.00    0.00
## eduger       39 11   73.95   23.54   73.55   73.64  25.47   40.05  110.67
## spropn       40 10    0.27    0.42    0.00    0.21   0.00    0.00    1.00
## yearele      41  8 1982.50   13.48 1990.50 1982.50   5.19 1965.00 1994.00
## yearreg      42  8 1982.50   13.48 1990.50 1982.50   5.19 1965.00 1994.00
## seats        43 11  151.20  109.96  122.22  136.21  86.65   37.33  400.00
## maj          44 11    0.73    0.47    1.00    0.78   0.00    0.00    1.00
## pres         45 11    0.64    0.50    1.00    0.67   0.00    0.00    1.00
## lyp          46 11    7.22    0.81    7.02    7.15   0.88    6.27    8.80
## semi         47 11    0.18    0.40    0.00    0.11   0.00    0.00    1.00
## majpar       48 11    0.18    0.40    0.00    0.11   0.00    0.00    1.00
## majpres      49 11    0.55    0.52    1.00    0.56   0.00    0.00    1.00
## propres      50 11    0.09    0.30    0.00    0.00   0.00    0.00    1.00
## dem_age      51 11 1975.82   24.77 1989.00 1981.11   7.41 1910.00 1994.00
## lat01        52 11    0.17    0.08    0.18    0.17   0.05    0.00    0.32
## age          53 11    0.12    0.12    0.05    0.09   0.04    0.03    0.45
## polityIV     54 11    2.34    5.56    0.22    2.42   6.75   -6.00   10.00
## spl          55  8   -1.55    4.52   -1.54   -1.55   1.91   -6.77    8.23
## cpi9500      56  9    5.70    1.15    5.90    5.70   1.14    3.93    7.55
## du_60ctry    57 11    0.27    0.47    0.00    0.22   0.00    0.00    1.00
## magn         58 11    0.71    0.41    1.00    0.75   0.00    0.02    1.00
## sdm          59  9    0.71    0.45    1.00    0.71   0.00    0.03    1.00
## oecd.x       60 11    0.00    0.00    0.00    0.00   0.00    0.00    0.00
## mining_gdp   61 10    8.43   11.70    4.10    5.89   5.71    0.02   37.20
## gini_8090    62  9   50.25    9.95   54.00   50.25  11.86   35.36   62.30
## con2150      63 11    0.00    0.00    0.00    0.00   0.00    0.00    0.00
## con5180      64 11    0.27    0.47    0.00    0.22   0.00    0.00    1.00
## con81        65 11    0.73    0.47    1.00    0.78   0.00    0.00    1.00
## list         66 11   49.83  119.87    0.00   16.46   0.00    0.00  400.00
## maj_bad      67 11    2.73    2.05    3.83    2.80   1.56    0.00    4.89
## maj_gin      68  9   37.31   22.84   41.35   37.31  18.75    0.00   62.00
## maj_old      69 11    0.06    0.07    0.04    0.06   0.06    0.00    0.17
## pres_bad     70 11    2.63    2.18    3.83    2.67   1.56    0.00    4.89
## pres_gin     71  9   26.72   26.59   35.36   26.72  39.50    0.00   62.00
## pres_old     72 11    0.04    0.05    0.03    0.03   0.04    0.00    0.17
## propar       73 11    0.18    0.40    0.00    0.11   0.00    0.00    1.00
## lpop         74  3   13.99    0.15   13.92   13.99   0.05   13.88   14.17
## continent*   75 11    1.00    0.00    1.00    1.00   0.00    1.00    1.00
##              range  skew kurtosis     se
## oecd          0.00   NaN      NaN   0.00
## country*     10.00  0.00    -1.53   1.00
## pind          1.00 -1.06    -0.79   0.13
## pindo         1.00 -1.06    -0.79   0.13
## ctrycd      555.00 -2.13     3.44  46.70
## col_uk        1.00 -1.43     0.08   0.12
## t_indep      80.00  1.38     1.88   5.96
## col_uka       0.92 -1.31    -0.14   0.10
## col_espa      0.00   NaN      NaN   0.00
## col_otha      0.96  1.58     0.79   0.10
## legor_uk      1.00 -1.43     0.08   0.12
## legor_so      0.00   NaN      NaN   0.00
## legor_fr      1.00  1.43     0.08   0.12
## legor_ge      0.00   NaN      NaN   0.00
## legor_sc      0.00   NaN      NaN   0.00
## prot80       64.10  0.46    -0.80   6.10
## catho80      47.70  0.71    -0.39   4.12
## confu         0.00   NaN      NaN   0.00
## avelf         0.46 -1.15     0.44   0.04
## govef         2.70  0.97    -0.17   0.25
## graft         2.30 -0.17    -1.62   0.23
## logyl         2.18  0.42    -1.43   0.23
## loga          2.29  0.03    -0.91   0.20
## yrsopen       1.00  1.72     2.15   0.09
## gadp          0.37  0.28    -1.38   0.04
## engfrac       0.09  0.95    -1.09   0.01
## eurfrac       0.57  2.24     3.76   0.05
## frankrom      1.77  0.54    -0.69   0.15
## latitude     43.90  0.44    -1.52   4.57
## gastil        3.28 -0.48    -1.45   0.35
## cgexp        23.92  0.06    -1.30   2.41
## cgrev        33.61  1.40     0.71   3.45
## ssw           3.36  0.52    -1.87   0.60
## rgdph      6136.54  1.50     1.28 552.55
## trade        97.29  0.31    -1.40   9.69
## prop1564     16.90  1.19     0.34   1.48
## prop65        3.92  1.47     1.16   0.35
## federal       0.00   NaN      NaN   0.00
## eduger       70.62  0.08    -1.50   7.10
## spropn        1.00  0.92    -1.07   0.13
## yearele      29.00 -0.41    -2.00   4.77
## yearreg      29.00 -0.41    -2.00   4.77
## seats       362.67  0.92    -0.20  33.16
## maj           1.00 -0.88    -1.31   0.14
## pres          1.00 -0.49    -1.91   0.15
## lyp           2.53  0.53    -1.18   0.25
## semi          1.00  1.43     0.08   0.12
## majpar        1.00  1.43     0.08   0.12
## majpres       1.00 -0.16    -2.15   0.16
## propres       1.00  2.47     4.52   0.09
## dem_age      84.00 -1.57     1.64   7.47
## lat01         0.32 -0.28    -0.38   0.03
## age           0.42  1.57     1.64   0.04
## polityIV     16.00  0.07    -1.63   1.68
## spl          15.00  0.98     0.05   1.60
## cpi9500       3.61  0.01    -1.45   0.38
## du_60ctry     1.00  0.88    -1.31   0.14
## magn          0.98 -0.58    -1.70   0.12
## sdm           0.97 -0.67    -1.63   0.15
## oecd.x        0.00   NaN      NaN   0.00
## mining_gdp   37.18  1.39     0.79   3.70
## gini_8090    26.94 -0.19    -1.71   3.32
## con2150       0.00   NaN      NaN   0.00
## con5180       1.00  0.88    -1.31   0.14
## con81         1.00 -0.88    -1.31   0.14
## list        400.00  2.22     3.64  36.14
## maj_bad       4.89 -0.32    -1.81   0.62
## maj_gin      62.00 -0.71    -1.17   7.61
## maj_old       0.17  0.69    -1.39   0.02
## pres_bad      4.89 -0.30    -1.92   0.66
## pres_gin     62.00  0.04    -1.97   8.86
## pres_old      0.17  1.66     2.10   0.02
## propar        1.00  1.43     0.08   0.12
## lpop          0.28  0.36    -2.33   0.09
## continent*    0.00   NaN      NaN   0.00
# Group-level summary statistics can be assigned to their own object for easy retrieval
africa_summary<-summary_stats_by_continent[["africa"]]
# Generate a table that displays summary statistics for trade at the continent level and assign to object named "trade_age_by_continent"
trade_age_by_continent<-pt_copy %>% group_by(continent) %>% 
                                    summarise(meanTrade=mean(trade),sdTrade=sd(trade),
                                              meanAge=mean(age), sdAge=sd(age),
                                              n=n())
# prints contents of "trade_age_by_continent"
trade_age_by_continent
## # A tibble: 4 × 6
##   continent meanTrade sdTrade meanAge  sdAge     n
##   <chr>         <dbl>   <dbl>   <dbl>  <dbl> <int>
## 1 africa         77.3    32.1   0.121 0.124     11
## 2 asiae          97.8    84.6   0.110 0.0846    13
## 3 laam           68.6    32.8   0.139 0.153     23
## 4 other          78.8    40.7   0.309 0.263     38
# Creates cross-tab showing the breakdown of federal/non federal across continents
crosstab_federal_continent<-pt_copy %>% tabyl(federal, continent)

4.2.3 Basic Data Cleaning and Preparation Tasks

Rearranging Columns

# bring the "country" column to the front of the dataset
pt_copy<-pt_copy %>% relocate(country)
# bring the "country", "list", "trade", "oecd" columns to the front of the dataset
pt_copy<-pt_copy %>% relocate(country, list, trade, oecd)

Renaming variables

## Renaming a variable (renames "list" to "party_list")
pt_copy<-pt_copy %>% rename(party_list=list)

Sorting a dataset in ascending or descending order with respect to a variable

# sorting in ascending (low to high) order with respect to the "trade" variable
pt_copy<-pt_copy %>% arrange(trade)
# sorting in descending (high to low) order with respect to the "trade" variable
pt_copy<-pt_copy %>% arrange(desc(trade))

Creating new variables based on existing variables

# Create new variable named "non_catholic_80" that is calculated by substracting the Catholic share of the population in 1980 ("catho80") from 100  and relocates "country", "catho80", and the newly created "non_catholic_80" to the front of the dataset
pt_copy<-pt_copy %>% mutate(non_catholic_80=100-catho80) %>% 
                     relocate(country, catho80, non_catholic_80)

Selecting or Deleting Variables

# Selects "country", "cgexp", "cgrev", and "trade" variables from the "pt_copy" dataset and assigns the selection to a new object named "pt_copy_selection"
pt_copy_selection<-pt_copy %>% select(country, cgexp, cgrev, trade, federal)
# deletes "federal" variable from "pt_copy_selection"
pt_copy_selection %>% select(-federal)
## # A tibble: 85 × 4
##    country       cgexp cgrev trade
##    <chr>         <dbl> <dbl> <dbl>
##  1 Singapore      18.5  34.7  343.
##  2 Malta          41.0  35.0  190.
##  3 Luxembourg     40.2  45.5  189.
##  4 Malaysia       24.5  26.8  176.
##  5 Estonia        30.0  31.1  154.
##  6 Belgium        47.9  43.7  132.
##  7 Ireland        38.1  34.8  129.
##  8 Mauritius      22.5  21.6  128.
##  9 St. Vincent&G  34.8  28.7  123.
## 10 Jamaica        NA    NA    122.
## # … with 75 more rows
# deletes "federal" and "trade" from "pt_copy_selection" and assigns it to new object named "pt_copy_selection_modified"
pt_copy_selection_modified<-pt_copy_selection %>% select(-c(federal, trade))

Recoding Variables

Creating Dummy Variables from Continuous Numeric Variables

# Creates a new dummy variable based on the existing "trade" variable named "trade_open" (which takes on a value of "1" if "trade" is greater than or equal to 77, and 0 otherwise) and then moves the newly created variable to the front of the dataset along with "country" and "trade"; all changes are assigned to "pt_copy", thereby overwriting the existing version of "pt_copy"

pt_copy<-pt_copy %>% mutate(trade_open=ifelse(trade>=77, 1, 0)) %>% 
                     relocate(country, trade_open, trade)

Creating categorical variables from continuous numeric variables

# Creates a new variable in the "pt_copy" dataset named "trade_level" (that is coded as "Low Trade" when the "trade" variable is greater than 15 and less than 50, coded as "Intermediate Trade" when "trade" is greater than or equal to 50 and less than 100, and coded as "High TradE" when "trade" is greater than or equal to 100), and then reorders the dataset such that "country", "trade_level", and "trade" are the first three variables in the dataset
pt_copy<-pt_copy %>% mutate(trade_level=case_when(trade>15 & trade<50~"Low_Trade",
                                                  trade>=50 & trade<100~"Intermediate_Trade",
                                                  trade>=100~"High_Trade")) %>% 
                    relocate(country, trade_level, trade)

Creating dummmy variables from categorical variables

# Creates dummy variables from "trade_level" column, and relocates the new dummies to the front of the dataset
pt_copy<-pt_copy %>% dummy_cols("trade_level") %>% 
                      relocate(country, trade_level, trade_level_High_Trade, trade_level_Intermediate_Trade, trade_level_Low_Trade)

Subsetting Variables

# Extracts OECD observations in "pt_copy" and assigns to object named "oecd_countries"
oecd_countries<-pt_copy %>% filter(oecd==1) %>% 
                            relocate(country, oecd)
# Extracts observations for which cgrev (central government revenue as % of gdp)>40, and assigns to object named "high_revenues"
high_revenues<-pt_copy %>% filter(cgrev>40) %>% 
                              relocate(country, cgrev)
# Extracts observations for which the "catho80" variable is less than or equal to 50
minority_catholic<-pt_copy %>% filter(catho80<=50) %>% 
                               relocate(country, catho80)
# Extracts federal OECD countries (where oecd=1 AND federal=1) and assigns to a new object named "oecd_federal_countries"
oecd_federal_countries<-pt_copy %>% filter(oecd==1 & federal==1) %>% 
                                      relocate(country, oecd, federal)
# Extracts observations that are in Africa ("africa") OR in Asia/Europe ("asiae) and assigns to an object named "asia_europe_africa"
asia_europe_africa<-pt_copy %>% filter(continent=="africa"|continent=="asiae") %>% 
                                  relocate(continent)
# Prints contents of "asia_europe_africa"
asia_europe_africa %>% datatable(extensions=c("Scroller", "FixedColumns"), options = list(
  deferRender = TRUE,
  scrollY = 350,
  scrollX = 350,
  dom = "t",
  scroller = TRUE,
  fixedColumns = list(leftColumns = 3)
))

Filtering for observations that do NOT meet a condition:

# Extracts all non-Africa observations and assigns to object named "pt_copy_sans_africa"
pt_copy_sans_africa<-pt_copy %>% filter(continent!="africa") %>% relocate(continent)
# Prints contents of "pt_copy_sans_africa"
pt_copy_sans_africa %>% datatable(extensions=c("Scroller", "FixedColumns"), options = list(
  deferRender = TRUE,
  scrollY = 350,
  scrollX = 350,
  dom = "t",
  scroller = TRUE,
  fixedColumns = list(leftColumns = 3)
))

4.2.4 Exploratory visualization using ggplot2

Bar Charts

# filters Africa observations
pt_africa<-pt_copy %>% 
            filter(continent=="africa")
# Creates a bar chart of the "cgexp" variable (central government expenditure as a share of GDP) for the Africa observations and assigns the plot to an object named "cgexp_africa"
cgexp_africa<-pt_africa %>% 
  drop_na(cgexp) %>% 
  ggplot()+
  geom_col(aes(x=country, y=cgexp))+
  labs(
    title="Central Govt Expenditure as Pct of GDP for Select African Countries (1990-1998 Average)",
    x="Country Name", 
    y="CGEXP")+
  theme(plot.title=element_text(hjust=0.5),
        axis.text.x = element_text(angle = 90))
# prints contents of cgexp_africa
cgexp_africa

# Creates a bar chart of the "cgexp" variable (central government expenditure as a share of GDP) for the Africa observations; countries are on the x axis and arrayed in ascending order with respect to the cgexp variable, which is on the y-axis; plot is assigned to an object named "cgexp_africa_ascending"
cgexp_africa_ascending<-
  pt_africa %>% 
  drop_na(cgexp) %>% 
  ggplot()+
  geom_col(aes(x=reorder(country, cgexp), y=cgexp))+
  labs(
    title="Central Govt Expenditure as Pct of GDP for Select African Countries (1990-1998 Average)",
    x="Country Name", 
    y="CGEXP")+
  theme(plot.title=element_text(hjust=0.5),
        axis.text.x = element_text(angle = 90))
# Creates a bar chart of the "cgexp" variable (central government expenditure as a share of GDP) for the Africa observations; countries are on the x axis and arrayed in descending order with respect to the cgexp variable, which is on the y-axis; plot is assigned to an object named "cgexp_africa_descending"
cgexp_africa_descending<-
  pt_africa %>% 
  drop_na(cgexp) %>% 
  ggplot()+
  geom_col(aes(x=reorder(country, -cgexp), y=cgexp))+
  labs(
    title="Central Govt Expenditure as Pct of GDP for Select African Countries (1990-1998 Average)",
    x="Country Name", 
    y="CGEXP")+
  theme(plot.title=element_text(hjust=0.5),
        axis.text.x = element_text(angle = 90))
cgexp_africa_ascending_inverted<-cgexp_africa_ascending+
                                    coord_flip()

Scatterplots

# Creates scatterplot with "cgexp" variable on x-axis and "trade" variiable on y-axis and assigns to object named "scatter_cgexp_trade"
scatter_cgexp_trade<-
  pt_copy %>% 
  drop_na(cgexp) %>% 
  ggplot()+
  geom_point(aes(x=cgexp, y=trade))+
  labs(title="Trade Share of GDP \nas a function of\n Central Govt Expenditure (1990-1998 Average) ", 
       x="Central Government Expenditure (Pct of GDP)", y="Overall Trade (Pct of GDP)")+
  theme(plot.title=element_text(hjust=0.5)) 
# prints contents of "scatter_cgexp_trade"
scatter_cgexp_trade

# Creates scatterplot with "cgexp" variable on x-axis and "trade" variable on y-axis, and uses different color points for different continents; plot is assigned to object named "scatter_cgexp_trade_grouped"
scatter_cgexp_trade_grouped<-
  pt_copy %>% 
  drop_na(cgexp) %>% 
  ggplot()+
  geom_point(aes(x=cgexp, y=trade, color=continent))+
  labs(title="Trade Share of GDP \nas a function of\n Central Govt Expenditure (1990-1998 Average) ", 
       x="Central Government Expenditure (Pct of GDP)", y="Overall Trade (Pct of GDP)")+
  theme(plot.title=element_text(hjust=0.5)) 
# prints contents of "scatter_cgexp_trade_grouped"
scatter_cgexp_trade_grouped

# Creates continent-level subplots for scatterplot, using facets; assigns plot to new object named "scatter_cgexp_trade_facets"
scatter_cgexp_trade_facets<-
  ggplot(data = pt_copy) + 
  geom_point(mapping = aes(x = cgexp, y = trade)) + 
  facet_wrap(~ continent, nrow = 2)
# prints contents of "scatter_cgexp_trade_facets"
scatter_cgexp_trade_facets
## Warning: Removed 3 rows containing missing values (geom_point).

# Creates scatterplot with "cgexp" variable on x-axis and "trade" variiable on y-axis, adds line of best fit; plot assigned to object named "scatter_cgexp_trade_line"
scatter_cgexp_trade_line<-
  pt_copy %>% 
  drop_na(cgexp) %>% 
  ggplot()+
  geom_point(aes(x=cgexp, y=trade))+
  geom_smooth(aes(x=cgexp, y=trade), method="lm")+
  labs(title="Trade Share of GDP \nas a function of\n Central Govt Expenditure (1990-1998 Average) ", 
       x="Central Government Expenditure (Pct of GDP)", y="Overall Trade (Pct of GDP)")+
  theme(plot.title=element_text(hjust=0.5)) 
# Prints contents of "scatter_cgexp_trade_line"
scatter_cgexp_trade_line
## `geom_smooth()` using formula 'y ~ x'

4.3 Text Data Processing, Manipulation, and Visualization

# prints ws_df_appended
ws_df_appended
## # A tibble: 12,686 × 70
##    `Publication Type` Authors     `Book Authors` `Book Editors` `Book Group Au…`
##    <chr>              <chr>       <chr>          <chr>          <chr>           
##  1 J                  Miles, M    <NA>           <NA>           <NA>            
##  2 J                  Dal Farra,… <NA>           <NA>           <NA>            
##  3 J                  Chen, MH    <NA>           <NA>           <NA>            
##  4 J                  Guy, S; He… <NA>           <NA>           <NA>            
##  5 J                  Baztan, J;… <NA>           <NA>           <NA>            
##  6 J                  Burke, M; … <NA>           <NA>           <NA>            
##  7 J                  Rodder, S   <NA>           <NA>           <NA>            
##  8 J                  Bentz, J; … <NA>           <NA>           <NA>            
##  9 J                  Ture, C     <NA>           <NA>           <NA>            
## 10 J                  Kim, S      <NA>           <NA>           <NA>            
## # … with 12,676 more rows, and 65 more variables: `Author Full Names` <chr>,
## #   `Book Author Full Names` <chr>, `Group Authors` <chr>,
## #   `Article Title` <chr>, `Source Title` <chr>, `Book Series Title` <chr>,
## #   `Book Series Subtitle` <lgl>, Language <chr>, `Document Type` <chr>,
## #   `Conference Title` <chr>, `Conference Date` <chr>,
## #   `Conference Location` <chr>, `Conference Sponsor` <chr>,
## #   `Conference Host` <chr>, `Author Keywords` <chr>, `Keywords Plus` <chr>, …
# selects "Abstract" column from "ws_df_appended" and assigns to new object named "wos_abstracts"
wos_abstracts<-ws_df_appended %>% select(Abstract)
# Tokenizes "Abstract" column text by word; assigns tokenized dataset (with words in "word" column) to a new object named "wos_abstracts_tokenized"
wos_abstracts_tokenized<-wos_abstracts %>% 
                          unnest_tokens(input=Abstract,
                                        token="words",
                                        output=word)
# generates frequency table from "wos_abstracts_tokenized", and assigns 
wos_abstracts_frequency<-wos_abstracts_tokenized %>% 
                          count(word, sort=TRUE)
# prints "stop_words" (part of the "tidytext" package)
stop_words
## # A tibble: 1,149 × 2
##    word        lexicon
##    <chr>       <chr>  
##  1 a           SMART  
##  2 a's         SMART  
##  3 able        SMART  
##  4 about       SMART  
##  5 above       SMART  
##  6 according   SMART  
##  7 accordingly SMART  
##  8 across      SMART  
##  9 actually    SMART  
## 10 after       SMART  
## # … with 1,139 more rows
# cleans "wos_abstracts_frequency" by removing stop words and removing numbers
wos_abstracts_frequency_cleaned<-wos_abstracts_frequency %>% 
                                    filter(!word %in% stop_words$word) %>% 
                                    filter(!grepl('[0-9]', word))   
# creates a new data frame that consists of the rows with the ten highest values for "n" (i.e. the ten most frequently recurring words) and assigns it to a new object named "wos_top_ten"
wos_top_ten<-wos_abstracts_frequency_cleaned %>% 
              slice_max(n, n=10)
# creates bar chart of word frequencies based on "wos_top_ten" and assigns to new object named "wos_frequency_graph'
wos_frequency_graph<-
  ggplot(data=wos_top_ten)+
    geom_col(aes(x=word, y=n))+
    labs(title="Ten Most Frequent Words in Abstracts of Publications on Climate + Art",
         caption = "Source: Web of Science", 
         x="", 
         y="Frequency")
# creates bar chart of word frequency data in "wos_top_ten" where words (on the x-axis) are arrayed in ascending order of frequency, and frequency (n) is represented on the Y axis; modified graph is assigned back to "wos_frequency_graph"
wos_frequency_graph<-
  ggplot(data=wos_top_ten)+
   geom_col(aes(x=reorder(word, n), y=n))+
    labs(title="Ten Most Frequent Words in Abstracts of Publications on Climate + Art",
         caption = "Source: Web of Science", 
         x="", 
         y="Frequency")
# inverts axes of "wos_frequency_graph" and assigns the result to a new object named "wos_frequency_graph_inverted
wos_frequency_graph_inverted<-
   wos_frequency_graph+
    coord_flip()

Word Cloud

# make word cloud based on word frequency information from "wos_abstracts_frequency_cleaned"
wordcloud2(data = wos_abstracts_frequency_cleaned, minRotation = 0, maxRotation = 0, ellipticity = 0.6)

4.4 Automating Data Processing Tasks

# write function to take input WOS dataset, select the "Authors", "Article Title", "Source Title", and "Language" columns, rename "Article Title" column to "Article and rename "Source Title" column to "Source", and then subset English language papers; the function is assigned to an object named "wos_clean_function"
wos_clean_function<-function(input_dataset){
  modified_dataset<-input_dataset %>% 
                      select(Authors, "Article Title", "Source Title", Language) %>% 
                      rename("Article"="Article Title",
                             "Source"="Source Title") %>% 
                      filter(Language=="English")
  return(modified_dataset)
}
# apply "wos_clean_function" to all list elements in "wos_file_list" and assign the new list of modified data frames to a new object named "processed_wos_list"
processed_wos_list<-map(wos_file_list, wos_clean_function)
# print contents of "processed_wos_list"
processed_wos_list
## [[1]]
## # A tibble: 964 × 4
##    Authors                                               Article Source Language
##    <chr>                                                 <chr>   <chr>  <chr>   
##  1 Miles, M                                              Repres… CULTU… English 
##  2 Dal Farra, R; Suarez, P                               RED CR… LEONA… English 
##  3 Guy, S; Henshaw, V; Heidrich, O                       Climat… JOURN… English 
##  4 Baztan, J; Vanderlinden, JP; Jaffres, L; Jorgensen, … Facing… CLIMA… English 
##  5 Burke, M; Tickwell, D; Whitmarsh, L                   Partic… GLOBA… English 
##  6 Rodder, S                                             The Cl… MINER… English 
##  7 Bentz, J; O'Brien, K                                  ART FO… ELEME… English 
##  8 Kim, S                                                Art th… ARTS … English 
##  9 Gabrys, J; Yusoff, K                                  Arts, … SCIEN… English 
## 10 Taplin, R                                             CONTEM… LEONA… English 
## # … with 954 more rows
## 
## [[2]]
## # A tibble: 978 × 4
##    Authors                                               Article Source Language
##    <chr>                                                 <chr>   <chr>  <chr>   
##  1 De Ollas, C; Morillon, R; Fotopoulos, V; Puertolas, … Facing… FRONT… English 
##  2 Mansfield, LA; Nowack, PJ; Kasoar, M; Everitt, RG; C… Predic… NPJ C… English 
##  3 Forrest, M                                            A Refl… PAIDE… English 
##  4 Chiodo, G; Garcia-Herrera, R; Calvo, N; Vaquero, JM;… The im… ENVIR… English 
##  5 Williams, KD; Jones, A; Roberts, DL; Senior, CA; Woo… The re… CLIMA… English 
##  6 Mascaro, G; Viola, F; Deidda, R                       Evalua… JOURN… English 
##  7 Anel, JA                                              High P… IBERG… English 
##  8 Crosato, A; Grissetti-Vazquez, A; Bregoli, F; Franca… Adapta… JOURN… English 
##  9 Barnett, TP; Pierce, DW; Schnur, R                    Detect… SCIEN… English 
## 10 Buckland, D; Lertzman, R                              David … ENVIR… English 
## # … with 968 more rows
## 
## [[3]]
## # A tibble: 969 × 4
##    Authors                                               Article Source Language
##    <chr>                                                 <chr>   <chr>  <chr>   
##  1 Goswami, BB; Khouider, B; Phani, R; Mukhopadhyay, P;… Implem… JOURN… English 
##  2 Berman, AL; Silvestri, GE; Tonello, MS                On the… QUATE… English 
##  3 Kulmala, M; Asmi, A; Lappalainen, HK; Carslaw, KS; P… Introd… ATMOS… English 
##  4 Touchan, R; Anchukaitis, KJ; Meko, DM; Sabir, M; Att… Spatio… CLIMA… English 
##  5 Zhang, L; Han, WQ; Hu, ZZ                             Interb… JOURN… English 
##  6 Fallah, A; Sungmin, O; Orth, R                        Climat… HYDRO… English 
##  7 Wagner, TJW; Eisenman, I                              How cl… GEOPH… English 
##  8 Bloschl, G; Ardoin-Bardin, S; Bonell, M; Dorninger, … UNESCO… CLIMA… English 
##  9 Hoang, T; Pulliat, G                                  Green … URBAN… English 
## 10 Sanz, T; Rodriguez-Labajos, B                         Does a… GEOFO… English 
## # … with 959 more rows
## 
## [[4]]
## # A tibble: 974 × 4
##    Authors                                               Article Source Language
##    <chr>                                                 <chr>   <chr>  <chr>   
##  1 Arce-Nazario, JA                                      Transl… JOURN… English 
##  2 Beevers, L; Popescu, I; Pregnolato, M; Liu, YX; Wrig… Identi… FRONT… English 
##  3 Staehelin, J; Tummon, F; Revell, L; Stenke, A; Peter… Tropos… ATMOS… English 
##  4 Zhang, YH; Seidel, DJ; Golaz, JC; Deser, C; Tomas, RA Climat… JOURN… English 
##  5 Sudantha, BH; Warusavitharana, EJ; Ratnayake, GR; Ma… Buildi… 2018 … English 
##  6 Franzen, C                                            Shelte… ENVIR… English 
##  7 van Oldenborgh, GJ; Drijfhout, S; van Ulden, A; Haar… Wester… CLIMA… English 
##  8 Tankersley, MS; Ledford, DK                           Stingi… JOURN… English 
##  9 Pistocchi, A; Sarigiannis, DA; Vizcaino, P            Spatia… SCIEN… English 
## 10 Biggs, HR; Desjardins, A                              Crafti… PROCE… English 
## # … with 964 more rows
## 
## [[5]]
## # A tibble: 964 × 4
##    Authors                                               Article Source Language
##    <chr>                                                 <chr>   <chr>  <chr>   
##  1 Esper, J; Klippel, L; Krusic, PJ; Konter, O; Raible,… Easter… CLIMA… English 
##  2 Hallgren, AM                                          (Un)st… KONST… English 
##  3 Caron, LP; Hermanson, L; Dobbin, A; Imbers, J; Lledo… How Sk… BULLE… English 
##  4 Sharif, K; Gormley, M                                 Integr… WATER  English 
##  5 Santer, BD; Wigley, TML; Gaffen, DJ; Bengtsson, L; D… Interp… SCIEN… English 
##  6 Sarospataki, M; Szabo, P; Fekete, A                   Future… LAND   English 
##  7 Pente, P                                              Slow M… INTER… English 
##  8 Marsh, C                                              Tagore… ASIAT… English 
##  9 Wang, XY; Dai, ZG; Zhang, EH; Fuyang, KE; Cao, YC; S… Tropos… ADVAN… English 
## 10 Servera-Vives, G; Riera, S; Picornell-Gelabert, L; M… The on… PALAE… English 
## # … with 954 more rows
## 
## [[6]]
## # A tibble: 974 × 4
##    Authors                                               Article Source Language
##    <chr>                                                 <chr>   <chr>  <chr>   
##  1 Kim, JK                                               Novel … MATER… English 
##  2 Singh, SJ; Fischer-Kowalski, M; Chertow, M            Introd… SUSTA… English 
##  3 D'Andrea, F; Provenzale, A; Vautard, R; De Noblet-Du… Hot an… GEOPH… English 
##  4 Grant, KM; Rohling, EJ; Bar-Matthews, M; Ayalon, A; … Rapid … NATURE English 
##  5 Guan, B; Waliser, DE; Ralph, FM                       A mult… ANNAL… English 
##  6 Hummel, M; Hoose, C; Pummer, B; Schaupp, C; Frohlich… Simula… ATMOS… English 
##  7 Rodney, L                                             Road S… SPACE… English 
##  8 Jouili, JS                                            Islam … COMPA… English 
##  9 Rodriguez-Labajos, B                                  Artist… CURRE… English 
## 10 Joyette, ART; Nurse, LA; Pulwarty, RS                 Disast… DISAS… English 
## # … with 964 more rows
## 
## [[7]]
## # A tibble: 976 × 4
##    Authors                                               Article Source Language
##    <chr>                                                 <chr>   <chr>  <chr>   
##  1 Farnoosh, A; Azari, B; Ostadabbas, S                  Deep S… THIRT… English 
##  2 Nahhas, TM; Kohl, H                                   Tradit… ARABI… English 
##  3 Yoon, J; Bae, S                                       Perfor… SUSTA… English 
##  4 Masnavi, MR; Gharai, F; Hajibandeh, M                 Explor… INTER… English 
##  5 Hu, ZZ; Kumar, A; Jha, B; Zhu, JS; Huang, BH          Persis… JOURN… English 
##  6 Gonzalez, FR; Raval, S; Taplin, R; Timms, W; Hitch, M Evalua… NATUR… English 
##  7 Hungilo, GG; Emmanuel, G; Emanuel, AWR                Image … 2019 … English 
##  8 Fernandes, LL; Lee, ES; McNeil, A; Jonsson, JC; Noui… Angula… ENERG… English 
##  9 Ingwersen, W; Gausman, M; Weisbrod, A; Sengupta, D; … Detail… JOURN… English 
## 10 Zilitinkevich, SS; Tyuryakov, SA; Troitskaya, YI; Ma… Theore… IZVES… English 
## # … with 966 more rows
## 
## [[8]]
## # A tibble: 977 × 4
##    Authors                                               Article Source Language
##    <chr>                                                 <chr>   <chr>  <chr>   
##  1 Chalal, ML; Benachir, M; White, M; Shrahily, R        Energy… RENEW… English 
##  2 El-Araby, E; Taher, M; El-Ghazawi, T; Le Moigne, J    Protot… FPT 0… English 
##  3 Silva, B; Prieto, B; Rivas, T; Sanchez-Biezma, MJ; P… Rapid … INTER… English 
##  4 Sorbet, J; Fernandez-Peruchena, C; Zaversky, F; Chak… Perfor… JOURN… English 
##  5 Smart, PDS; Thanammal, KK; Sujatha, SS                A nove… SADHA… English 
##  6 Colding, J; Wallhagen, M; Sorqyist, P; Marcus, L; Hi… Applyi… SMART… English 
##  7 Antonini, E; Vodola, V; Gaspari, J; De Giglio, M      Outdoo… ENERG… English 
##  8 Woodworth, PL                                         Differ… JOURN… English 
##  9 Roh, JS; Kim, S                                       All-fa… JOURN… English 
## 10 Loch, CH; Terwiesch, C                                Accele… JOURN… English 
## # … with 967 more rows
## 
## [[9]]
## # A tibble: 975 × 4
##    Authors                                               Article Source Language
##    <chr>                                                 <chr>   <chr>  <chr>   
##  1 Rajer, A; Heard, C                                    Pink p… CONSE… English 
##  2 Hashioka, T; Vogt, M; Yamanaka, Y; Le Quere, C; Buit… Phytop… BIOGE… English 
##  3 Halova, P; Kroupova, ZZ; Havlikova, M; Cechura, L; M… Provis… AGRAR… English 
##  4 MARTI, C; BADIA, D                                    CHARAC… ARID … English 
##  5 Taveres-Cachat, E; Grynning, S; Almas, O; Goia, F     Advanc… 11TH … English 
##  6 Haida, M; Palacz, M; Bodys, J; Smolka, J; Gullo, P; … An exp… APPLI… English 
##  7 Schmid, R                                             Pocket… SAGE … English 
##  8 Manzini, E; Cagnazzo, C; Fogli, PG; Bellucci, A; Mul… Strato… GEOPH… English 
##  9 Chang, WL; Griffin, RJ; Dabdub, D                     Partit… PROCE… English 
## 10 Bruhwiler, PA; Buyan, M; Huber, R; Bogerd, CP; Sznit… Heat t… JOURN… English 
## # … with 965 more rows
## 
## [[10]]
## # A tibble: 976 × 4
##    Authors                                               Article Source Language
##    <chr>                                                 <chr>   <chr>  <chr>   
##  1 Paietta, E                                            Commen… BEST … English 
##  2 Chen, N; Paek, SY; Lee, JY; Park, JH; Lee, SY; Lee, … High-p… ENERG… English 
##  3 Bruckert, J; Hoshyaripour, GA; Horvath, A; Muser, LO… Online… ATMOS… English 
##  4 Cha, MS; Park, JE; Kim, S; Han, SH; Shin, SH; Yang, … Poly(c… ENERG… English 
##  5 Brewin, RJW; Sathyendranath, S; Platt, T; Bouman, H;… Sensin… EARTH… English 
##  6 Lamane, H; Moussadek, R; Baghdad, B; Mouhir, L; Bria… Soil w… HELIY… English 
##  7 Rodriguez, A; Alejo-Reyes, A; Cuevas, E; Robles-Camp… Numeri… MATHE… English 
##  8 Cortesi, U; Ceccherini, S; Del Bianco, S; Gai, M; Ti… Advanc… ATMOS… English 
##  9 Buerki, S; Jose, S; Yadav, SR; Goldblatt, P; Manning… Contra… PLOS … English 
## 10 Nastula, J; Ponte, RM; Salstein, DA                   Compar… GEOPH… English 
## # … with 966 more rows
## 
## [[11]]
## # A tibble: 988 × 4
##    Authors                                               Article Source Language
##    <chr>                                                 <chr>   <chr>  <chr>   
##  1 Zhao, XY; Miao, CH                                    Spatia… INTER… English 
##  2 Pacheco-Torgal, F                                     Eco-ef… CONST… English 
##  3 Nasiyev, B; Gabdulov, M; Zhanatalapov, N; Makanova, G Study … RESEA… English 
##  4 Helmig, D; Petrenko, V; Martinerie, P; Witrant, E; R… Recons… ATMOS… English 
##  5 Voigt, C; Schumann, U; Graf, K                        CONTRA… PROGR… English 
##  6 Hartman, S; Ogilvie, AEJ; Ingimundarson, JH; Dugmore… Mediev… GLOBA… English 
##  7 Konstantiniuk, F; Krobath, M; Ecker, W; Tkadletz, M;… Influe… INTER… English 
##  8 Yu, TT; Leng, H; Yuan, Q; Jiang, CY                   Vulner… JOURN… English 
##  9 Karam, N; Khiat, A; Algergawy, A; Sattler, M; Weilan… Matchi… KNOWL… English 
## 10 Di Prima, S; Castellini, M; Pirastru, M; Keesstra, S  Soil W… WATER  English 
## # … with 978 more rows
## 
## [[12]]
## # A tibble: 981 × 4
##    Authors                                               Article Source Language
##    <chr>                                                 <chr>   <chr>  <chr>   
##  1 Shah, ARY; Shah, KS; Shah, CR; Shah, MA               State … RENEW… English 
##  2 Lindskog, M; Ridal, M; Thorsteinsson, S; Ning, T      Data a… ATMOS… English 
##  3 Morciano, M; Fasano, M; Boriskina, SV; Chiavazzo, E;… Solar … ENERG… English 
##  4 Lebo, ZJ; Morrison, H                                 A Nove… JOURN… English 
##  5 Wilczynska, D; Lysak-Radomska, A; Podczarska-Glowack… Effect… BMC S… English 
##  6 Ballard, S                                            Nonorg… FAR F… English 
##  7 Pinkovetskaia, I; Gromova, T; Nikitina, I             Produc… TARIH… English 
##  8 Xiong, Y; Zhang, JP; Yan, Y; Sun, SB; Xu, XY; Higuer… Effect… SUSTA… English 
##  9 Bui, DT; Hoang, ND; Pham, TD; Ngo, PTT; Hoa, PV; Min… A new … JOURN… English 
## 10 Alloza, JA; Vallejo, R                                Restor… DESER… English 
## # … with 971 more rows
## 
## [[13]]
## # A tibble: 676 × 4
##    Authors                                               Article Source Language
##    <chr>                                                 <chr>   <chr>  <chr>   
##  1 Marchese, C; de la Guardia, LC; Myers, PG; Belanger,… Region… ECOLO… English 
##  2 Blood, A; Starr, G; Escobedo, F; Chappelka, A; Staud… How Do… FORES… English 
##  3 Bonannella, C; Chirici, G; Travaglini, D; Pecchi, M;… Charac… FIRE-… English 
##  4 Lazarev, S; Kuiper, KF; Oms, O; Bukhsianidze, M; Vas… Five-f… GLOBA… English 
##  5 Gliss, J; Mortier, A; Schulz, M; Andrews, E; Balkans… AeroCo… ATMOS… English 
##  6 Mueller, CW; Gutsch, M; Kothieringer, K; Leifeld, J;… Bioava… SOIL … English 
##  7 Cohen, M; Quigley, K                                  Submar… AESTH… English 
##  8 Nyamekye, AB; Dewulf, A; Van Slobbe, E; Termeer, K    Inform… AFRIC… English 
##  9 Myriokefalitakis, S; Groger, M; Hieronymus, J; Dosch… An exp… OCEAN… English 
## 10 Viana, M; Hammingh, P; Colette, A; Querol, X; Degrae… Impact… ATMOS… English 
## # … with 666 more rows
# extract one of the list elements from "processed_wos_list"
processed_wos_list[[3]]
## # A tibble: 969 × 4
##    Authors                                               Article Source Language
##    <chr>                                                 <chr>   <chr>  <chr>   
##  1 Goswami, BB; Khouider, B; Phani, R; Mukhopadhyay, P;… Implem… JOURN… English 
##  2 Berman, AL; Silvestri, GE; Tonello, MS                On the… QUATE… English 
##  3 Kulmala, M; Asmi, A; Lappalainen, HK; Carslaw, KS; P… Introd… ATMOS… English 
##  4 Touchan, R; Anchukaitis, KJ; Meko, DM; Sabir, M; Att… Spatio… CLIMA… English 
##  5 Zhang, L; Han, WQ; Hu, ZZ                             Interb… JOURN… English 
##  6 Fallah, A; Sungmin, O; Orth, R                        Climat… HYDRO… English 
##  7 Wagner, TJW; Eisenman, I                              How cl… GEOPH… English 
##  8 Bloschl, G; Ardoin-Bardin, S; Bonell, M; Dorninger, … UNESCO… CLIMA… English 
##  9 Hoang, T; Pulliat, G                                  Green … URBAN… English 
## 10 Sanz, T; Rodriguez-Labajos, B                         Does a… GEOFO… English 
## # … with 959 more rows

4.5 Data Transfer Part 2: Exporting Data

4.5.1 Exporting a data frame

4.5.2 Exporting Multiple data frames

# Removes the ".csv" suffix from the strings in "wos_files" and then assigns the modified character vector to a new object named "base_names"
base_names<-str_remove(wos_files, ".csv")
# appends the suffix "_modified.csv" to the strings in the "base_names" character vector, and assigns the resulting character vector to a new object named "processed_wos_list_names"
processed_wos_list_names<-paste0(base_names, "_modified.csv")
# prints contents of "processed_wos_list_names"
processed_wos_list_names
##  [1] "ClimateAndArt_01_modified.csv" "ClimateAndArt_02_modified.csv"
##  [3] "ClimateAndArt_03_modified.csv" "ClimateAndArt_04_modified.csv"
##  [5] "ClimateAndArt_05_modified.csv" "ClimateAndArt_06_modified.csv"
##  [7] "ClimateAndArt_07_modified.csv" "ClimateAndArt_08_modified.csv"
##  [9] "ClimateAndArt_09_modified.csv" "ClimateAndArt_10_modified.csv"
## [11] "ClimateAndArt_11_modified.csv" "ClimateAndArt_12_modified.csv"
## [13] "ClimateAndArt_13_modified.csv"
# uses the walk2 function to iteratively apply the "write_csv" function, using the data frames in "processed_wos_list" and the file names in "processed_wos_list_names" as arguments; the files are written out to the working directory
walk2(processed_wos_list, processed_wos_list_names, write_csv)

The code above takes the first dataset in processed_wos_list and the first desired filename in processed_wos_list_names, and uses these as arguments to the write_csv function, which results in a new CSV file written to the working directory that contains the first data frame in processed_wos_list that is named after the first string element in processed_wos_list_names (“ClimateAndArt_01_modified.csv”); it then takes the second second dataset in processed_wos_list and the second desired filename in processed_wos_list_names, and uses these as arguments to the write_csv function, which results in a new CSV file written to the working directory that contains the second data frame in processed_wos_list that is named after the second string element in processed_wos_list_names (“ClimateAndArt_02_modified.csv”); and so on for the other data frames and specified file names.

Check your working directory to make sure that all 13 of the modified files have been written out to disk.

4.5.3 Exporting other objects (i.e. visualizations)

5 R Tools for Reproducibility and Communication